hello,大家好,前面分享了很多有关TCR的数据分析,今天我们来继续这个专题,今天的分享都是为后续最终的目的做准备,现在还都只是中间过程,好了,我们今天我分享的软件是changeo,大家可以参考文章Change-O: a toolkit for analyzing large-scale B cell immunoglobulin repertoire sequencing data,2015年便发表了,而我们分享这个的目的在于集成这个软件的算法,达到我们最后这个专题的目的,这个软件主要关注在BCR的数据分析,我们来看看这个软件的算法和原理。
背景
高通量测序技术的进步现在允许对 B 细胞受体 (BCR) 和 T 细胞受体 (TCR) 库进行大规模表征。 适应性免疫受体库 (AIRR) 的高种系和体细胞多样性对具有生物学意义的分析提出了挑战——需要开发专门的计算方法。
Imcantation 框架为高通量 AIRR-seq 数据集提供了一个从头到尾的分析生态系统。 从原始读取开始,提供 Python 和 R 包用于预处理、群体结构确定和repertoire分析。 (这个我们当然喜闻乐见)。
核心软件(软件非常的多,这个专题我们也要持续很久)。
Contributed Packages
当然,时至今日,我们只关心软件对于10X数据的分析(限于个人背景),我们来看看软件的处理和结果。
1、Converting 10X V(D)J data into the AIRR Community standardized format
To process 10X V(D)J data, a combination of AssignGenes.py and MakeDb.py(这是两个集成的分析模块,作者已经封装好,这会儿我们只需要知道其作用,后续会做详细的分享) can be used to generate a TSV file compliant with the AIRR Community Rearrangement(当然这里又是一个集成的算法,我们也需要了解,后续我们详细的分享,先了解一下概念,重排是描述重排的适应性免疫受体链(例如,抗体重链或 TCR β 链)以及大量注释的序列。 这些注释由 AIRR Rearrangement 模式定义,包括八个类别。 ) schema that incorporates annotation information provided by the Cell Ranger pipeline. The --10x filtered_contig_annotations.csv
specifies the path of the contig annotations file generated by cellranger vdj
, which can be found in the outs
directory.免疫真的很难,需要耐心和时间。
Generate AIRR Rearrangement data from the 10X V(D)J FASTA files using the steps below:
AssignGenes.py igblast -s filtered_contig.fasta -b ~/share/igblast \
--organism human --loci ig --format blast
MakeDb.py igblast -i filtered_contig_igblast.fmt7 -s filtered_contig.fasta \
-r IMGT_Human_*.fasta --10x filtered_contig_annotations.csv --extended
- 注:all_contig.fasta can be exchanged for filtered_contig.fasta, and all_contig_annotations.csv can be exchanged for filtered_contig_annotations.csv.
一些参数的转换
- To process mouse data and/or TCR data alter the
--organism
and--loci
arguments to AssignGenes.py accordingly (e.g.,--organism mouse
,--loci tcr
) and use the appropriate V, D and J IMGT reference databases (e.g.,IMGT_Mouse_TR*.fasta
)
2、 Identifying clones from B cells in AIRR formatted 10X V(D)J data
Splitting into separate light and heavy chain files(这个地方跟之前TCR的处理类似,但是方式不一样)。
要将 B 细胞从 AIRR 重排数据分组为克隆,必须将 MakeDb.py 的输出解析为轻链文件和重链文件:
ParseDb.py select -d 10x_igblast_db-pass.tsv -f locus -u "IGH" \
--logic all --regex --outname heavy
ParseDb.py select -d 10x_igblast_db-pass.tsv -f locus -u "IG[LK]" \
--logic all --regex --outname light
接下来就是要进行数据的过滤了,这个过滤和我们的普通转录组差别巨大。
The ParseDb.py(又是一个集成模块) tool provides a basic set of operations for manipulating Change-O database files from the commandline, including removing or updating rows and columns.
Removing non-productive sequences
After building a Change-O database from either IMGT/HighV-QUEST or IgBLAST output, you may wish to subset your data to only productive sequences. This can be done in one of two roughly equivalent ways using the ParseDb.py tool:(挑选有用的数据)
ParseDb.py select -d HD13M_db-pass.tsv -f productive -u T
ParseDb.py split -d HD13M_db-pass.tsv -f productive
上面的第一行使用 select 子命令输出标记为 parse-select 的单个文件,该文件仅包含productive
列 (-f productive) 中值为 T (-u T) 的记录。
或者,上面的第二行使用 split 子命令输出多个文件,每个文件包含具有productive
列中找到的值之一的记录 (-f productive)。 这将生成标记为productive-T
和productive-F
的两个文件。
消除 C 区引物和参考比对之间的分歧
If you have data that includes both heavy and light chains in the same library, the V-segment and J-segment alignments from IMGT/HighV-QUEST or IgBLAST may not always agree with the isotype assignments from the C-region primers. In these cases, you can filter out such reads with the select subcommand of ParseDb.py. An example function call using an imaginary file db.tsv
is provided below:
ParseDb.py select -d db.tsv -f v_call j_call c_call -u "IGH" \
--logic all --regex --outname heavy
ParseDb.py select -d db.tsv -f v_call j_call c_call -u "IG[LK]" \
--logic all --regex --outname light
这些命令将要求所有 v_call、j_call 和 c_call 区域(-f v_call j_call c_call 和 --logic all)包含字符串 IGH(第 1-2 行)或 IGK 或 IGL(第 3-4 行)之一。 --regex
参数允许对正则表达式进行部分匹配和解释。 这两个命令的输出是两个文件,一个只包含重链(heavy_parse-select.tsv),一个只包含轻链(light_parse-select.tsv)。
Exporting records to FASTA files
You may want to use external tools, or tools from pRESTO, on your Change-O result files. The ConvertDb.py tool provides two options for exporting data from tab-delimited files to FASTA format.
Standard FASTA
ConvertDb.py fasta -d HD13M_db-pass.tsv --if sequence_id \
--sf sequence_alignment --mf v_call duplicate_count
Where the column containing the sequence identifier is specified by --if sequence_id, the nucleotide sequence column is specified by --sf sequence_id, and additional annotations to be added to the sequence header are specified by --mf v_call duplicate_count.(我们一般也就需要转换成fasta)
Assign clonal groups to the heavy chain data(最为关键的聚类)
The heavy chain file must then be clonally clustered separately.(重链需要单独聚类,那到底如何做呢?)
Clustering sequences into clonal groups
首先第一步:Determining a clustering threshold
Before running DefineClones.py(集成模块,我们暂时先知道是做聚类的), it is important to determine an appropriate threshold for trimming the hierarchical clustering into B cell clones.(看来这里的聚类是层次聚类)。The distToNearest function in the SHazaM R package calculates the distance between each sequence in the data and its nearest-neighbor.(利用了R包SHazaM来计算序列的distance,这里就像是之前分享的利用TCRdist计算TCR之间的序列距离),结果分布应该是双峰的,第一种模式表示数据集中具有克隆亲属的序列,第二种模式表示singletons(这个单词大家意会一下)。The ideal threshold for separating clonal groups is the value that separates the two modes of this distribution and can be found using the findThreshold function in the SHazaM R package.(阈值的选择既有人工也有软件识别),The distToNearest function allows selection of all parameters that are available in DefineClones.py. Using the length normalization parameter ensures that mutations are weighted equally regardless of junction sequence length.The distance to nearest-neighbor distribution for the example data is shown below. The threshold is approximately 0.16 - indicated by the red dotted line.(看来最好还是人工选择,当然关于阈值的划分我们还会再分享一篇文章)。
第二步、克隆聚类(主要针对B细胞)
There are several parameter choices when grouping Ig sequences into B cell clones. The argument --act
set accounts for ambiguous V gene and J gene calls when grouping similar sequences. The distance metric --model ham
is nucleotide Hamming distance. Because the threshold was generated using length normalized distances, the --norm len
argument is selected with the previously determined threshold --dist 0.16:
DefineClones.py -d HD13M_db-pass.tsv --act set --model ham \
--norm len --dist 0.16
Correct clonal groups based on light chain data
DefineClones.py currently does not support light chain cloning. However, cloning can be performed after heavy chain cloning using light_cluster.py provided on the Immcantation Bitbucket repository in the scripts
directory:
light_cluster.py -d heavy_select-pass_clone-pass.tsv -e light_select-pass.tsv \
-o 10X_clone-pass.tsv
Here, heavy_select-pass_clone-pass.tsv
refers to the cloned heavy chain AIRR Rearrangement file, light_select-pass.tsv
refers to the light chain file, and 10X_clone-pass.tsv
is the resulting output file.
算法优势
- remove cells associated with more than one heavy chain
-
correct heavy chain clone definitions based on an analysis of the light chain partners associated with the heavy chain clone.(相互辅助的结果)
接下来就是Reconstructing germline sequences(重建种系序列)
Adding germline sequences to the database
The CreateGermlines.py tool is used to reconstruct the germline V(D)J sequence, from which the Ig lineage and mutations can be inferred(推断ig的谱系和突变)。 In addition to the alignment information parsed by MakeDb.py to generate the initial database, CreateGermlines.py also requires the set of germline sequences that were used for the alignment passed to the -r
argument. In the case of V-segment germlines, the reference sequences must be IMGT-gapped. Because the D-segment call for B cell receptor alignments is often low confidence, the default germline format (-g dmask) places Ns in the N/P and D-segments of the junction region rather than using the D-segment assigned during reference alignment; this can be modified to generate a complete germline (-g full) or a V-segment only germline (-g vonly) if you wish. The command below adds the germline sequence to the germline_alignment_d_mask column of the output database:(这一部分应该是可选的,但是谱系示踪确实非常重要)
CreateGermlines.py -d HD13M_db-pass.tsv -g dmask \
-r IMGT_Human_IGHV.fasta IMGT_Human_IGHD.fasta IMGT_Human_IGHJ.fasta
Alternatively, if you have run the clonal assignment task prior to invoking CreateGermlines.py, then adding the --cloned
argument is recommended, as this will generate a single germline of consensus length for each clone:(当然,会涉及到很多其他软件的适用,我们慢慢来,欲速则不达)。
CreateGermlines.py -d HD13M_db-pass_clone-pass.tsv -g dmask --cloned \
-r IMGT_Human_IGHV.fasta IMGT_Human_IGHD.fasta IMGT_Human_IGHJ.fasta
IgPhyML lineage tree analysis(这也是这个软件的最终目的)
IgPhyML 是一个程序,旨在构建系统发育树并测试有关 B 细胞亲和力成熟的进化假设。
B 细胞体细胞超突变 (SHM) 的生物学违反了大多数标准系统发育替代模型中的重要假设; 此外,虽然大多数系统发育程序旨在分析单个谱系,但 B 细胞库通常包含数千个谱系。 IgPhyML 通过实施替代模型来解决这两个问题,这些模型纠正了 SHM 的上下文敏感性质,并结合了来自多个谱系的信息,以提供更精确估计的全库模型参数估计。
An in-depth description of IgPhyML installation and usage can be found at the IgPhyML website.(这个软件我们也需要深入分析和分享了,真的太多了).
分析开始
Once installed, IgPhyML can be run through BuildTrees by specifying the --igphyml
option. IgPhyML is easiest to run through the Immcantation Docker image. If this is not possible, these instructions require Change-O 0.4.6 or higher, Alakazam 0.3.0 or higher, and IgPhyML to be installed, with the executable in your PATH
variable.
The following commands should work as a first pass on many reasonably sized datasets, but if you really want to understand what’s going on or make sure what you’re doing makes sense, please check out the IgPhyML website.(后续分享这个软件)。
Build trees and estimate model parameters
# Clone IgPhyML repository to get example files
git clone https://bitbucket.org/kleinstein/igphyml
# Move to examples directory
cd igphyml/examples
# Run BuildTrees
BuildTrees.py -d example.tsv --outname ex --log ex.log --collapse \
--sample 3000 --igphyml --clean all --nproc 1
此命令处理 BCR 序列的 AIRR 格式数据集,该数据集已与重建的生殖系进行克隆聚类。 然后使用 GY94 模型快速构建树,并使用这些固定拓扑估计 HLP19 模型参数。 这可以通过增加 --nproc 选项来加速。 使用 --sample 选项进行子采样并不是绝对必要的,但 IgPhyML 在应用于大型数据集时会运行缓慢。 在这里, --collapse 标志用于折叠相同的序列。 强烈建议这样做,因为相同的序列会减慢计算速度,而不会影响 IgPhyML 中的似然值。
可视化
可以使用 Alakazam 的 readIgphyml 函数读取上述命令的输出文件(又是一个集成模块)。 在示例子文件夹中打开 R 会话后,输入以下命令。 请注意,使用 Docker 容器时,您需要在绘制树后运行 dev.off() 以在示例目录中创建 pdf 绘图:
library(alakazam)
library(igraph)
db = readIgphyml("ex_igphyml-pass.tab")
# Plot largest lineage tree
plot(db$trees[[1]],layout=layout_as_tree)
# Show HLP10 parameters
print(t(db$param[1,]))
CLONE "REPERTOIRE"
NSEQ "4"
NSITE "107"
TREE_LENGTH "0.286"
LHOOD "-290.7928"
KAPPA_MLE "2.266"
OMEGA_FWR_MLE "0.5284"
OMEGA_CDR_MLE "2.3324"
WRC_2_MLE "4.8019"
GYW_0_MLE "3.4464"
WA_1_MLE "5.972"
TW_0_MLE "0.8131"
SYC_2_MLE "-0.99"
GRS_0_MLE "0.2583"
To visualize a larger dataset with bigger trees, and bifurcating tree topologies, again open an R session in the examples directory:
library(alakazam)
library(ape)
db = readIgphyml("sample1_igphyml-pass.tab",format="phylo")
# Plot largest lineage tree
plot(ladderize(db$trees[[1]]),cex=0.7,no.margin=TRUE)
真的是又多又难,生活很好,有你更好