ABBA-BABA分析方法最早发布于2012年NATURE的文章:Butterfly genome reveals promiscuous exchange of mimicry adaptations among species(https://www.nature.com/articles/nature11041)
原理:
简单来说,假设有三个物种,P1,P2,P3,利用它们的全基因组SNPs建树得到的是下图中的树,即P1和P2为单系,再与P3聚到一起。那么根据分子钟原理,P1与P2的共同祖先从P3中分离出,而后经过相同时间演化出P1和P2,那么以下几点应该成立:
(1)【P1到P3】和【P2到P3】的遗传距离应该是大致相同的(考虑到不同位点的进化速率不同,至少在全基因组范围,其平均遗传距离应该是大抵相同的)。
(2)假如P2与P3之间发生过杂交事件,那么P2-P3的距离应该小于P1-P3的举例;反之,如果P1与P3发生过杂交事件,则P1-P3<P2-P3。
(3)ABBA-BABA顾名思义,将遗传距离相近的两物种用同一字母表示(A或B),ABBA即为P2-P3之间距离较近(但此时不能下定论,要通过统计量确定其两者是否存在遗传渐渗),BABA即为P1与P3之间距离较近。
(4)D统计(D-statistics)以及Z-score的计算:假设使用全基因组SNPs数据进行研究,无论是使用滑窗法(将连续的数个SNPs一起分析)还是单个SNPs进行研究,最终的D统计都是:D=(ABBA树数量-BABA树数量)/(ABBA树数量+BABA树数量)。看到这里也就懂了,如果D统计量=0,说明总体ABBA和BABA的数量相同,不存在明显的渗入;如果不等于0,则或许存在渗入。进一步计算Z-score,在本软件中用的是:
𝑍-score = 𝐷/𝑠𝑡𝑑_𝑒𝑟𝑟(𝐷)
Z>3和<-3分别是正负显著。
P.S. 其实Patterson‘s D统计就是ABBA-BABA统计。
实际软件操作:
Dsuite软件链接:https://github.com/millanek/Dsuite
Publication:Malinsky, M., Matschiner, M. and Svardal, H. (2020), Dsuite ‐ fast D‐statistics and related admixture evidence from VCF files. Mol Ecol Resour. Accepted Author Manuscript. https://doi.org/10.1111/1755-0998.13265
Dsuite的优势:
(1)可以使用SNPs和Indels
(2)个体及种群无上限(可到几百)
(3)可以设置genomic sliding windows
(4)可以进行全基因组范围的D统计量和f4-ratio计算(f4是一种统计量,用来验证一棵无根树是否符合预想的拓扑,具体解释见后文,类似的还有f3,f2等等)
1、安装
$ git clone https://github.com/millanek/Dsuite.git
$ cd Dsuite
$ make
2、命令
Commands:
Dtrios Calculate D (ABBA-BABA) and f4-ratio statistics for all possible trios of populations/species
DtriosCombine Combine results from Dtrios runs across genomic regions (e.g. per-chromosome)
Dinvestigate Follow up analyses for trios with significantly elevated D:
calculates f_d, f_dM, and d_f in windows along the genome
Fbranch Calculate D and f statistics for branches on a tree that relates the populations/species
Experimental:
Dquartets Calculate D (ABBA-BABA) and f4-ratio statistics for all possible quartets of populations/species
(no outgroup specified)
Usage:
a) Dsuite Dtrios [OPTIONS] INPUT_FILE.vcf SETS.txt
b) Dsuite Dquartets [OPTIONS] INPUT_FILE.vcf SETS.txt
c) Dsuite Dinvestigate [OPTIONS] INPUT_FILE.vcf.gz SETS.txt test_trios.txt
d) Dsuite Fbranch [OPTIONS] TREE_FILE.nwk FVALS_tree.txt
一般应该是用Dtrios
3、输入文件
(1)vcf文件,可以包含SNPs和Indels的多等位基因位点,但必须是biallelic的。.vcf
(2)个体及种群名录(Population/species map )SETS.txt
如下图所示 注意要用tab分割
Ind1 Species1
Ind2 Species1
Ind3 Species2
Ind4 Species2
Ind5 Species3
Ind6 Outgroup
Ind7 Outgroup
Ind8 xxx
... ...
IndN Species_n
想用多少用多少个体,如果中途希望取子集研究,只需要把不需要的个体用xxx替代就行。
在Dtrios
命令中至少要指定一个outgroup。
在Dquartets
中不应该有outgroup,因为这个命令是探索所有可能patterns的。
要用另外三个算法的话要输入额外的文件,这里不再列出,可以参考github(见上文链接)
4、运行Dtrios
Usage: Dsuite Dtrios [OPTIONS] INPUT_FILE.vcf SETS.txt
Calculate the D (ABBA/BABA) and f4-ratio (f_G) statistics for all trios of species in the dataset (the outgroup being fixed)
The results are as definded in Patterson et al. 2012 (equivalent to Durand et al. 2011 when the Outgroup is fixed for the ancestral allele)
The SETS.txt should have two columns: SAMPLE_ID SPECIES_ID
The outgroup (can be multiple samples) should be specified by using the keywork Outgroup in place of the SPECIES_ID
-h, --help display this help and exit
-k, --JKnum (default=20) the number of Jackknife blocks to divide the dataset into; should be at least 20 for the whole dataset
-j, --JKwindow (default=NA) Jackknife block size in number of informative SNPs (as used in v0.2)
when specified, this is used in place of the --JKnum option
-r, --region=start,length (optional) only process a subset of the VCF file
-t, --tree=TREE_FILE.nwk (optional) a file with a tree in the newick format specifying the relationships between populations/species
D and f4-ratio values for trios arranged according to the tree will be output in a file with _tree.txt suffix
-o, --out-prefix=OUT_FILE_PREFIX (optional) the prefix for the files where the results should be written
output will be put in OUT_FILE_PREFIX_BBAA.txt, OUT_FILE_PREFIX_Dmin.txt, OUT_FILE_PREFIX_tree.txt etc.
by default, the prefix is taken from the name of the SETS.txt file
-n, --run-name (optional) run-name will be included in the output file name after the PREFIX
--no-f4-ratio (optional) don't calculate the f4-ratio
-l NUMLINES (optional) the number of lines in the VCF input - required if reading the VCF via a unix pipe
5、总结
总而言之,最后的代码大概就这一行,参数可以都为默认
Dsuite Dtrios [OPTIONS] INPUT_FILE.vcf SETS.txt
最后得到的BBAA.txt, Dmin.txt, tree.txt就是最终结果,之后再绘图。
如果是将基因组分成若干文件来跑的,需要注意设置【-n, --run-name】参数,最后可以用DtriosCombine将combine.txt和combine_stderr.txt合并。
f4统计量简介
直接看下文公式:假设有A、B、C、D四个个体,代表四个类群,aj/bj/cj/dj分别是该类群在位点j的参考基因组基因频率(reference allele frequency)。总位点J个。分母是用来将统计量正态化的,类群x可以左右选择,对好的是选择与感兴趣类群关系较近的population。
同样的,f4统计量最后也要用计算Zscroe,以正负3为显著性的阈值。