fastsimcoal2推断群体演化历史(Demographic history)

1. 群体演化历史分析的案例：

群体演化包括有效群体大小(Ne)变化、基因流，迁徙，分化等，会对等位基因频率产生显著影响，塑造了现有遗传多样性的模式和水平。群体演化、遗传漂变和自然选择共同决定了基因组遗传多样性的命运。
Example: 苏格兰罕见的食肉褐鳟（rare piscivorous brown trout: ferox）与普通褐鳟（normal brown trout）在进化过程中是否有基因交流？

2. 如何通过测序数据进行群体演化历史的推断？

（1）群体的系谱进化树的形状包含了群体演化历史的模式

（2）在不同的群体演化历史的模式下，SFS有不同的分布

什么是SFS(site frequency spectrum)：
请参考位点频谱(SFS)计算
单个群体

两个群体

颜色越偏红，表示数量越多，越偏蓝表示数量越少。如果两个群体完全分开，那它们derived allele频率相同的交集就越少，表现在2D SFS上就是密度偏向各自的坐标轴，如果群体交流混合，它们derived allele频率相同的交集就越多，表现在2D SFS上就是密度偏向x=y的这条对角线。后面两个模型对SFS的影响很像，都是使两群体的SFS趋同，可能结合群体分化时间，核苷酸突变速率等推断具体是哪种模型。

（3）通过最大似然估计得到最佳参数

个人理解是，通过推断参数，给出模型，使expected SFS更合理，也就是更加类似于两个分离的群体。模型参数包括：群推分歧时间（也可以自己提供），基因流，以及有效群体大小。

（4）计算软件：fastsimcoal2

fastsimcoal2是由伯尔尼大学的Laurent Excoffier小组2016年开发的一种非常灵活的人口统计(Demography)建模软件。它通过执行合并模拟，使用位点频谱（SFS），推断最适合所观察数据的模型参数。

models.png

软件下载
官网：http://cmpg.unibe.ch/software/fastsimcoal2/
使用fastsimcoal2和模拟数据在简单模型下推断参数
输入文件

3. fastsimcoal2实战分析

使用fastsimcoal2和软件自带的测试数据在简单模型下推断参数

（1）准备输入文件：

输入文件1：SFS文件
文件2PopDiv20Mb_jointDAFpop1_0.obs是两个群体的观测2D-SFS.

# set the working directory
setwd("/Software/fsc26_mac64/example files/2PopDiv20Mb")
# read the observed SFS
# 2D from two populations with different effective sizes that diverged some time ago
pop2d <- read.table("2PopDiv20Mb_jointDAFpop1_0.obs", skip=1, stringsAsFactors = F, header=T, row.names = 1)
head(pop2d)
         d0_0 d0_1 d0_2 d0_3 d0_4 d0_5
d1_0 19985747 8350 1628  360   62    8
d1_1      966    0    0    0    0    0
d1_2      479    0    0    0    0    0
d1_3      328    0    0    0    0    0
d1_4      249    0    0    0    0    0
d1_5     1760   13   18   13   19    0

输入文件2：定义模型文件
每个模型都在TPL文件中定义，我们要推断的参数都有名称标签（例如NPOP，TDIV）

//Parameters for the coalescence simulation program : fsimcoal2.exe
2 samples to simulate :
//Population effective sizes (number of genes)
NPOP1
NPOP2
//Samples sizes and samples age 
5
5
//Growth rates  : negative growth implies population expansion
0
0
//Number of migration matrices : 0 implies no migration between demes
0
//historical event: time, source, sink, migrants, new deme size, new growth rate, migration matrix index
1  historical event
TDIV 1 0 1 RESIZE0 0 0
//Number of independent loci [chromosome] 
1 0
//Per chromosome: Number of contiguous linkage Block: a block is a set of contiguous loci
1
//per Block:data type, number of loci, per generation recombination and mutation rates and optional parameters
FREQ  1   0   2.5e-8 OUTEXP

EST文件中定义了每个参数的搜索范围。对于每个参数，我们使用相应的参数名称标签指定搜索范围。

// Search ranges and rules file
// ****************************

[PARAMETERS]
//#isInt? #name   #dist.#min  #max 
//all Ns are in number of haploid individuals
1  NPOP1       logunif  10   1e7   output
1  NPOP2       logunif  10   1e7   output
1  NANC        logunif  10   1e7   output 
1  TDIV        unif     10   1e5   output 

[RULES]

[COMPLEX PARAMETERS]

0  RESIZE0   = NANC/NPOP1      hide

注意：TPL和EST文件需要具有相同的文件名，但具有不同的扩展名：[filename] .est和[filename] .tpl。

（2）运行fastsimcoal2：

#参数估计（推测群体演化历史）
fsc26 -t PopDiv_diff.tpl -e PopDiv_diff.est -d -0 -n 100000 -L 40 -s 0 -M -q -c 80
#还可以产生模拟群体数据（另一个功能）
#使用输入参数文件中定义的参数值来模拟进化模型下的数据
fsc26 -i test.par -n 100
#使用从先验随机抽取的参数值来模拟进化模型下的数据
fsc26 -t test.tpl -n 10 -e test.est -E 100
#使用在外部文件中定义的参数值来模拟进化模型下的数据
fsc26 -t test.tpl -n 100 -f test.def

参数说明：

-n: 模拟次数，该值应大于100,000。建议使用200,000到1,000,000。
-L: 迭代次数，该值至少应为20，建议使用50和100之间。
-M: 使用似然优化推断参数。
-d: 对于derived SFS使用-d，对于MAF SFS使用-m。
-0: 说明观察到的SFS中没有单态位。
-q: 快速模式，不打印所有信息。
-C: 为具有至少1个SNP的所有输入计算似然。如果指定-Cx，则所有少于x个SNP的条目将汇集在一起。当观察到SFS中很多位点SNP很少时，此选项用以避免过度拟合。
-c: 指定多线程的选项。 -c1 -B1用于单核，-c4 -B4用于4核。

（3）软件输出：

对我们最重要的三个文件：

* .bestlhoods：具有最大似然参数估计值和相应似然性的文件。 这就是我们想要的-参数估计！
* _DAFpop0.txt：具有通过优化过程中使似然性最大化的参数获得的预期SFS的文件，对于检查SFS是否合适是必需的。
* .simparam：文件中包含运行模拟的设置示例，用于错误时检查。

查看结果

#读取最大似然估计参数文件
maxlhoodEstimates <- read.table(paste(settings$pathTo_InputFile, "/",
                                      settings$TPL_EST_file_tag, "/",
                                      settings$TPL_EST_file_tag, ".bestlhoods",
                                      sep=""), header=T)
#查看估计参数
maxlhoodEstimates

其中，MaxObsLhood指如果期望值与观察到的SFS完全匹配，即期望的SFS是相对观察到的SFS，则值越大。MaxEstLhood是根据模型参数估计的最大似然，拟合度越高，MaxObsLhood和MaxEstLhood之间的差异就越小。

#获取观察到的和预期的SFS
# Fit of the model expected SFS to the observed SFS

# Tag for the end of the observed SFS file
obsfileend <- "_jointDAFpop1_0"

# Read the observed SFS - SNP counts
obssfs <- read.table(paste(settings$pathTo_InputFile,
                    settings$TPL_EST_file_tag, obsfileend, ".obs",
                    sep=""), skip=1, stringsAsFactors = F, header=T)

# Read the expected SFS - PROBABILITIES
expsfs <- read.table(paste(settings$pathTo_InputFile,
                    settings$TPL_EST_file_tag, "/",
                    settings$TPL_EST_file_tag, obsfileend, ".txt",
                    sep=""), header=T)

# Plot the fit of the 2D SFS, including of the marginal 1D SFS
# the function plot2dSFS is defined in the utilfunctions.r
# you need to give as input the following arguments
#    obsSFS : matrix with observed SFS (counts)
#    expSFS : matrix with expected SFS (probabilities)
#    xtag : string with the label of x-axis
#    ytag : string with the label of y-axis
#    minentry : number with the minimum entry in the SFS (all entries with less than this are pooled together)
plot2dSFS(obsSFS=obssfs, expSFS=expsfs, xtag="Pop2", ytag="Pop1", minentry=1)

#绘制模型
#maxL.par file 是一个最大似然估计参数文件
path_to_maxL_file <- paste(settings$pathTo_InputFile, settings$TPL_EST_file_tag, "/",settings$TPL_EST_file_tag, "_maxL", sep="")
parFileInterpreter(args=path_to_maxL_file, pop.names=c("Pop1","Pop2"), gentime=1, printPDF=FALSE)

参考：