10X单细胞（10X空间转录组）BCR（TCR）数据分析之(9)changeo

hello，大家好，前面分享了很多有关TCR的数据分析，今天我们来继续这个专题,今天的分享都是为后续最终的目的做准备，现在还都只是中间过程，好了，我们今天我分享的软件是changeo，大家可以参考文章Change-O: a toolkit for analyzing large-scale B cell immunoglobulin repertoire sequencing data，2015年便发表了，而我们分享这个的目的在于集成这个软件的算法，达到我们最后这个专题的目的，这个软件主要关注在BCR的数据分析，我们来看看这个软件的算法和原理。

背景

高通量测序技术的进步现在允许对 B 细胞受体 (BCR) 和 T 细胞受体 (TCR) 库进行大规模表征。适应性免疫受体库 (AIRR) 的高种系和体细胞多样性对具有生物学意义的分析提出了挑战——需要开发专门的计算方法。

Imcantation 框架为高通量 AIRR-seq 数据集提供了一个从头到尾的分析生态系统。从原始读取开始，提供 Python 和 R 包用于预处理、群体结构确定和repertoire分析。（这个我们当然喜闻乐见）。

核心软件（软件非常的多，这个专题我们也要持续很久）。

图片.png

Contributed Packages

图片.png

当然，时至今日，我们只关心软件对于10X数据的分析（限于个人背景），我们来看看软件的处理和结果。

1、Converting 10X V(D)J data into the AIRR Community standardized format

To process 10X V(D)J data, a combination of AssignGenes.py and MakeDb.py(这是两个集成的分析模块，作者已经封装好，这会儿我们只需要知道其作用，后续会做详细的分享) can be used to generate a TSV file compliant with the AIRR Community Rearrangement（当然这里又是一个集成的算法，我们也需要了解，后续我们详细的分享，先了解一下概念，重排是描述重排的适应性免疫受体链（例如，抗体重链或 TCR β 链）以及大量注释的序列。这些注释由 AIRR Rearrangement 模式定义，包括八个类别。） schema that incorporates annotation information provided by the Cell Ranger pipeline. The `--10x filtered_contig_annotations.csv` specifies the path of the contig annotations file generated by `cellranger vdj`, which can be found in the `outs` directory.免疫真的很难，需要耐心和时间。

Generate AIRR Rearrangement data from the 10X V(D)J FASTA files using the steps below:

AssignGenes.py igblast -s filtered_contig.fasta -b ~/share/igblast \
   --organism human --loci ig --format blast
MakeDb.py igblast -i filtered_contig_igblast.fmt7 -s filtered_contig.fasta \
   -r IMGT_Human_*.fasta --10x filtered_contig_annotations.csv --extended

注：all_contig.fasta can be exchanged for filtered_contig.fasta, and all_contig_annotations.csv can be exchanged for filtered_contig_annotations.csv.

一些参数的转换

To process mouse data and/or TCR data alter the --organism and --loci arguments to AssignGenes.py accordingly (e.g., --organism mouse, --loci tcr) and use the appropriate V, D and J IMGT reference databases (e.g., IMGT_Mouse_TR*.fasta)

2、 Identifying clones from B cells in AIRR formatted 10X V(D)J data

Splitting into separate light and heavy chain files（这个地方跟之前TCR的处理类似，但是方式不一样）。

要将 B 细胞从 AIRR 重排数据分组为克隆，必须将 MakeDb.py 的输出解析为轻链文件和重链文件：

ParseDb.py select -d 10x_igblast_db-pass.tsv -f locus -u "IGH" \
        --logic all --regex --outname heavy
ParseDb.py select -d 10x_igblast_db-pass.tsv -f locus -u "IG[LK]" \
        --logic all --regex --outname light

接下来就是要进行数据的过滤了，这个过滤和我们的普通转录组差别巨大。

The ParseDb.py（又是一个集成模块） tool provides a basic set of operations for manipulating Change-O database files from the commandline, including removing or updating rows and columns.

Removing non-productive sequences

After building a Change-O database from either IMGT/HighV-QUEST or IgBLAST output, you may wish to subset your data to only productive sequences. This can be done in one of two roughly equivalent ways using the ParseDb.py tool:(挑选有用的数据)

ParseDb.py select -d HD13M_db-pass.tsv -f productive -u T
ParseDb.py split -d HD13M_db-pass.tsv -f productive

上面的第一行使用 select 子命令输出标记为 parse-select 的单个文件，该文件仅包含`productive`列 (-f productive) 中值为 T (-u T) 的记录。

或者，上面的第二行使用 split 子命令输出多个文件，每个文件包含具有`productive`列中找到的值之一的记录 (-f productive)。这将生成标记为`productive-T`和`productive-F`的两个文件。

消除 C 区引物和参考比对之间的分歧

If you have data that includes both heavy and light chains in the same library, the V-segment and J-segment alignments from IMGT/HighV-QUEST or IgBLAST may not always agree with the isotype assignments from the C-region primers. In these cases, you can filter out such reads with the select subcommand of ParseDb.py. An example function call using an imaginary file `db.tsv` is provided below:

ParseDb.py select -d db.tsv -f v_call j_call c_call -u "IGH" \
    --logic all --regex --outname heavy
ParseDb.py select -d db.tsv -f v_call j_call c_call -u "IG[LK]" \
    --logic all --regex --outname light

这些命令将要求所有 v_call、j_call 和 c_call 区域（-f v_call j_call c_call 和 --logic all）包含字符串 IGH（第 1-2 行）或 IGK 或 IGL（第 3-4 行）之一。 `--regex`参数允许对正则表达式进行部分匹配和解释。这两个命令的输出是两个文件，一个只包含重链（heavy_parse-select.tsv），一个只包含轻链（light_parse-select.tsv）。

Exporting records to FASTA files

You may want to use external tools, or tools from pRESTO, on your Change-O result files. The ConvertDb.py tool provides two options for exporting data from tab-delimited files to FASTA format.

Standard FASTA

ConvertDb.py fasta -d HD13M_db-pass.tsv --if sequence_id \
    --sf sequence_alignment --mf v_call duplicate_count

Where the column containing the sequence identifier is specified by --if sequence_id, the nucleotide sequence column is specified by --sf sequence_id, and additional annotations to be added to the sequence header are specified by --mf v_call duplicate_count.(我们一般也就需要转换成fasta)

Assign clonal groups to the heavy chain data（最为关键的聚类）

The heavy chain file must then be clonally clustered separately.(重链需要单独聚类，那到底如何做呢？)

Clustering sequences into clonal groups

首先第一步：Determining a clustering threshold

Before running DefineClones.py（集成模块，我们暂时先知道是做聚类的）, it is important to determine an appropriate threshold for trimming the hierarchical clustering into B cell clones.（看来这里的聚类是层次聚类）。The distToNearest function in the SHazaM R package calculates the distance between each sequence in the data and its nearest-neighbor.（利用了R包SHazaM来计算序列的distance，这里就像是之前分享的利用TCRdist计算TCR之间的序列距离）,结果分布应该是双峰的，第一种模式表示数据集中具有克隆亲属的序列，第二种模式表示singletons(这个单词大家意会一下)。The ideal threshold for separating clonal groups is the value that separates the two modes of this distribution and can be found using the findThreshold function in the SHazaM R package.（阈值的选择既有人工也有软件识别）,The distToNearest function allows selection of all parameters that are available in DefineClones.py. Using the length normalization parameter ensures that mutations are weighted equally regardless of junction sequence length.The distance to nearest-neighbor distribution for the example data is shown below. The threshold is approximately 0.16 - indicated by the red dotted line.(看来最好还是人工选择，当然关于阈值的划分我们还会再分享一篇文章)。

图片.png

第二步、克隆聚类（主要针对B细胞）

There are several parameter choices when grouping Ig sequences into B cell clones. The argument `--act` set accounts for ambiguous V gene and J gene calls when grouping similar sequences. The distance metric `--model ham` is nucleotide Hamming distance. Because the threshold was generated using length normalized distances, the `--norm len` argument is selected with the previously determined threshold --dist 0.16:

DefineClones.py -d HD13M_db-pass.tsv --act set --model ham \
    --norm len --dist 0.16

Correct clonal groups based on light chain data

DefineClones.py currently does not support light chain cloning. However, cloning can be performed after heavy chain cloning using light_cluster.py provided on the Immcantation Bitbucket repository in the `scripts` directory:

light_cluster.py -d heavy_select-pass_clone-pass.tsv -e light_select-pass.tsv \
        -o 10X_clone-pass.tsv

Here, `heavy_select-pass_clone-pass.tsv` refers to the cloned heavy chain AIRR Rearrangement file, `light_select-pass.tsv` refers to the light chain file, and `10X_clone-pass.tsv` is the resulting output file.

算法优势

remove cells associated with more than one heavy chain
correct heavy chain clone definitions based on an analysis of the light chain partners associated with the heavy chain clone.（相互辅助的结果）

图片.png

接下来就是Reconstructing germline sequences（重建种系序列）

Adding germline sequences to the database

The CreateGermlines.py tool is used to reconstruct the germline V(D)J sequence, from which the Ig lineage and mutations can be inferred（推断ig的谱系和突变）。 In addition to the alignment information parsed by MakeDb.py to generate the initial database, CreateGermlines.py also requires the set of germline sequences that were used for the alignment passed to the `-r` argument. In the case of V-segment germlines, the reference sequences must be IMGT-gapped. Because the D-segment call for B cell receptor alignments is often low confidence, the default germline format (-g dmask) places Ns in the N/P and D-segments of the junction region rather than using the D-segment assigned during reference alignment; this can be modified to generate a complete germline (-g full) or a V-segment only germline (-g vonly) if you wish. The command below adds the germline sequence to the germline_alignment_d_mask column of the output database:（这一部分应该是可选的，但是谱系示踪确实非常重要）

CreateGermlines.py -d HD13M_db-pass.tsv -g dmask \
    -r IMGT_Human_IGHV.fasta IMGT_Human_IGHD.fasta IMGT_Human_IGHJ.fasta

Alternatively, if you have run the clonal assignment task prior to invoking CreateGermlines.py, then adding the `--cloned` argument is recommended, as this will generate a single germline of consensus length for each clone:(当然，会涉及到很多其他软件的适用，我们慢慢来，欲速则不达)。

CreateGermlines.py -d HD13M_db-pass_clone-pass.tsv -g dmask --cloned \
    -r IMGT_Human_IGHV.fasta IMGT_Human_IGHD.fasta IMGT_Human_IGHJ.fasta

图片.png

IgPhyML lineage tree analysis（这也是这个软件的最终目的）

IgPhyML 是一个程序，旨在构建系统发育树并测试有关 B 细胞亲和力成熟的进化假设。

B 细胞体细胞超突变 (SHM) 的生物学违反了大多数标准系统发育替代模型中的重要假设；此外，虽然大多数系统发育程序旨在分析单个谱系，但 B 细胞库通常包含数千个谱系。 IgPhyML 通过实施替代模型来解决这两个问题，这些模型纠正了 SHM 的上下文敏感性质，并结合了来自多个谱系的信息，以提供更精确估计的全库模型参数估计。

An in-depth description of IgPhyML installation and usage can be found at the IgPhyML website.（这个软件我们也需要深入分析和分享了，真的太多了）.

分析开始

Once installed, IgPhyML can be run through BuildTrees by specifying the `--igphyml` option. IgPhyML is easiest to run through the Immcantation Docker image. If this is not possible, these instructions require Change-O 0.4.6 or higher, Alakazam 0.3.0 or higher, and IgPhyML to be installed, with the executable in your `PATH` variable.

The following commands should work as a first pass on many reasonably sized datasets, but if you really want to understand what’s going on or make sure what you’re doing makes sense, please check out the IgPhyML website.（后续分享这个软件）。

Build trees and estimate model parameters

# Clone IgPhyML repository to get example files
git clone https://bitbucket.org/kleinstein/igphyml

# Move to examples directory
cd igphyml/examples

# Run BuildTrees
BuildTrees.py -d example.tsv --outname ex --log ex.log --collapse \
    --sample 3000 --igphyml --clean all --nproc 1

此命令处理 BCR 序列的 AIRR 格式数据集，该数据集已与重建的生殖系进行克隆聚类。然后使用 GY94 模型快速构建树，并使用这些固定拓扑估计 HLP19 模型参数。这可以通过增加 --nproc 选项来加速。使用 --sample 选项进行子采样并不是绝对必要的，但 IgPhyML 在应用于大型数据集时会运行缓慢。在这里， --collapse 标志用于折叠相同的序列。强烈建议这样做，因为相同的序列会减慢计算速度，而不会影响 IgPhyML 中的似然值。

可视化

可以使用 Alakazam 的 readIgphyml 函数读取上述命令的输出文件（又是一个集成模块）。在示例子文件夹中打开 R 会话后，输入以下命令。请注意，使用 Docker 容器时，您需要在绘制树后运行 dev.off() 以在示例目录中创建 pdf 绘图：

library(alakazam)
library(igraph)

db = readIgphyml("ex_igphyml-pass.tab")

# Plot largest lineage tree
plot(db$trees[[1]],layout=layout_as_tree)

# Show HLP10 parameters
print(t(db$param[1,]))
CLONE         "REPERTOIRE"
NSEQ          "4"
NSITE         "107"
TREE_LENGTH   "0.286"
LHOOD         "-290.7928"
KAPPA_MLE     "2.266"
OMEGA_FWR_MLE "0.5284"
OMEGA_CDR_MLE "2.3324"
WRC_2_MLE     "4.8019"
GYW_0_MLE     "3.4464"
WA_1_MLE      "5.972"
TW_0_MLE      "0.8131"
SYC_2_MLE     "-0.99"
GRS_0_MLE     "0.2583"

图片.png

To visualize a larger dataset with bigger trees, and bifurcating tree topologies, again open an R session in the examples directory:

library(alakazam)
library(ape)

db = readIgphyml("sample1_igphyml-pass.tab",format="phylo")

# Plot largest lineage tree
plot(ladderize(db$trees[[1]]),cex=0.7,no.margin=TRUE)

图片.png

真的是又多又难，生活很好，有你更好

禁止转载，如需转载请通过简信或评论联系作者。

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 205,132评论 6赞 478
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 87,802评论 2赞 381
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 151,566评论 0赞 338
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 54,858评论 1赞 277
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 63,867评论 5赞 368
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,695评论 1赞 282
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 38,064评论 3赞 399
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,705评论 0赞 258
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 42,915评论 1赞 300
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,677评论 2赞 323
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,796评论 1赞 333
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,432评论 4赞 322
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 39,041评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,992评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,223评论 1赞 260
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 45,185评论 2赞 352
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,535评论 2赞 343

10X单细胞（10X空间转录组）BCR（TCR）数据分析之(9)changeo

背景

高通量测序技术的进步现在允许对 B 细胞受体 (BCR) 和 T 细胞受体 (TCR) 库进行大规模表征。 适应性免疫受体库 (AIRR) 的高种系和体细胞多样性对具有生物学意义的分析提出了挑战——需要开发专门的计算方法。

Imcantation 框架为高通量 AIRR-seq 数据集提供了一个从头到尾的分析生态系统。 从原始读取开始，提供 Python 和 R 包用于预处理、群体结构确定和repertoire分析。 （这个我们当然喜闻乐见）。

核心软件（软件非常的多，这个专题我们也要持续很久）。

Contributed Packages

当然，时至今日，我们只关心软件对于10X数据的分析（限于个人背景），我们来看看软件的处理和结果。

1、Converting 10X V(D)J data into the AIRR Community standardized format

Generate AIRR Rearrangement data from the 10X V(D)J FASTA files using the steps below:

一些参数的转换

2、 Identifying clones from B cells in AIRR formatted 10X V(D)J data

Splitting into separate light and heavy chain files（这个地方跟之前TCR的处理类似，但是方式不一样）。

要将 B 细胞从 AIRR 重排数据分组为克隆，必须将 MakeDb.py 的输出解析为轻链文件和重链文件：

接下来就是要进行数据的过滤了，这个过滤和我们的普通转录组差别巨大。

The ParseDb.py（又是一个集成模块） tool provides a basic set of operations for manipulating Change-O database files from the commandline, including removing or updating rows and columns.

Removing non-productive sequences

After building a Change-O database from either IMGT/HighV-QUEST or IgBLAST output, you may wish to subset your data to only productive sequences. This can be done in one of two roughly equivalent ways using the ParseDb.py tool:(挑选有用的数据)

上面的第一行使用 select 子命令输出标记为 parse-select 的单个文件，该文件仅包含productive列 (-f productive) 中值为 T (-u T) 的记录。

或者，上面的第二行使用 split 子命令输出多个文件，每个文件包含具有productive列中找到的值之一的记录 (-f productive)。 这将生成标记为productive-T和productive-F的两个文件。

消除 C 区引物和参考比对之间的分歧

Exporting records to FASTA files

You may want to use external tools, or tools from pRESTO, on your Change-O result files. The ConvertDb.py tool provides two options for exporting data from tab-delimited files to FASTA format.

Standard FASTA

Where the column containing the sequence identifier is specified by --if sequence_id, the nucleotide sequence column is specified by --sf sequence_id, and additional annotations to be added to the sequence header are specified by --mf v_call duplicate_count.(我们一般也就需要转换成fasta)

Assign clonal groups to the heavy chain data（最为关键的聚类）

The heavy chain file must then be clonally clustered separately.(重链需要单独聚类，那到底如何做呢？)

Clustering sequences into clonal groups

首先第一步：Determining a clustering threshold

第二步、克隆聚类（主要针对B细胞）

Correct clonal groups based on light chain data

DefineClones.py currently does not support light chain cloning. However, cloning can be performed after heavy chain cloning using light_cluster.py provided on the Immcantation Bitbucket repository in the scripts directory:

Here, heavy_select-pass_clone-pass.tsv refers to the cloned heavy chain AIRR Rearrangement file, light_select-pass.tsv refers to the light chain file, and 10X_clone-pass.tsv is the resulting output file.

算法优势

接下来就是Reconstructing germline sequences（重建种系序列）

Adding germline sequences to the database

IgPhyML lineage tree analysis（这也是这个软件的最终目的）

IgPhyML 是一个程序，旨在构建系统发育树并测试有关 B 细胞亲和力成熟的进化假设。

An in-depth description of IgPhyML installation and usage can be found at the IgPhyML website.（这个软件我们也需要深入分析和分享了，真的太多了）.

分析开始

The following commands should work as a first pass on many reasonably sized datasets, but if you really want to understand what’s going on or make sure what you’re doing makes sense, please check out the IgPhyML website.（后续分享这个软件）。

Build trees and estimate model parameters

可视化

可以使用 Alakazam 的 readIgphyml 函数读取上述命令的输出文件（又是一个集成模块）。 在示例子文件夹中打开 R 会话后，输入以下命令。 请注意，使用 Docker 容器时，您需要在绘制树后运行 dev.off() 以在示例目录中创建 pdf 绘图：

To visualize a larger dataset with bigger trees, and bifurcating tree topologies, again open an R session in the examples directory:

推荐阅读更多精彩内容

高通量测序技术的进步现在允许对 B 细胞受体 (BCR) 和 T 细胞受体 (TCR) 库进行大规模表征。适应性免疫受体库 (AIRR) 的高种系和体细胞多样性对具有生物学意义的分析提出了挑战——需要开发专门的计算方法。

Imcantation 框架为高通量 AIRR-seq 数据集提供了一个从头到尾的分析生态系统。从原始读取开始，提供 Python 和 R 包用于预处理、群体结构确定和repertoire分析。（这个我们当然喜闻乐见）。

上面的第一行使用 select 子命令输出标记为 parse-select 的单个文件，该文件仅包含`productive`列 (-f productive) 中值为 T (-u T) 的记录。

或者，上面的第二行使用 split 子命令输出多个文件，每个文件包含具有`productive`列中找到的值之一的记录 (-f productive)。这将生成标记为`productive-T`和`productive-F`的两个文件。

DefineClones.py currently does not support light chain cloning. However, cloning can be performed after heavy chain cloning using light_cluster.py provided on the Immcantation Bitbucket repository in the `scripts` directory:

Here, `heavy_select-pass_clone-pass.tsv` refers to the cloned heavy chain AIRR Rearrangement file, `light_select-pass.tsv` refers to the light chain file, and `10X_clone-pass.tsv` is the resulting output file.

可以使用 Alakazam 的 readIgphyml 函数读取上述命令的输出文件（又是一个集成模块）。在示例子文件夹中打开 R 会话后，输入以下命令。请注意，使用 Docker 容器时，您需要在绘制树后运行 dev.off() 以在示例目录中创建 pdf 绘图：