准备文件:
test.cds.fa
、test.pep.fa
$ grep ">" test.pep.fa
>AT3G47090
>AT3G47110
>AT3G47570
>AT3G47580
>AT5G20480
>AT5G39390
1. 分步计算
1.1 序列全局比对
注意:这里我将同源基因两两配对,在实际分析中选择的是共线性块的基因对。
$ # 这里用 mafft
$ mafft --auto --thread 10 test.AT3G47110-AT3G47090.pep.fa > test.AT3G47110-AT3G47090.pep.fa.aln
1.2 pal2nal.pl 回译
-
-output
: 输出格式(clustal
|paml
|fasta
|codon
),默认clustal
-
-nogap
: 删除 gap
$ pal2nal.pl test.AT3G47110-AT3G47090.pep.fa.aln test.AT3G47110-AT3G47090.cds.fa -output clustal -nogap > test.AT3G47110-AT3G47090.fa.aln.clustal
1.3 KaKs_Calculator2
- AXTConvertor
Usage: AXTConvertor [Clustal/Msf/Nexus/Phylip/Pir] [AXT]
AXTConvertor test.AT3G47110-AT3G47090.fa.aln.clustal test.AT3G47110-AT3G47090.fa.aln.axt
- KaKs_Calculator
-i
: axt 文件,将各基因对axt
文件cat到一起,生成test.pep.fa.axt
-o
: 结果文件
-c
: 密码子表
-m
: 计算方法(NG
|YN
|......),默认MA
,
$ KaKs_Calculator -i test.pep.fa.axt -o test.pep.fa.kaks
Method(s): MA
Genetic code: 1-Standard Code
Please wait while reading sequences and calculating...
1 AT3G47110&AT3G47090 [OK]
2 AT3G47570&AT3G47090 [OK]
3 AT3G47570&AT3G47110 [OK]
4 AT3G47580&AT3G47090 [OK]
5 AT3G47580&AT3G47110 [OK]
6 AT3G47580&AT3G47570 [OK]
7 AT5G20480&AT3G47090 [OK]
8 AT5G20480&AT3G47110 [OK]
9 AT5G20480&AT3G47570 [OK]
10 AT5G20480&AT3G47580 [OK]
11 AT5G39390&AT3G47090 [OK]
12 AT5G39390&AT3G47110 [OK]
13 AT5G39390&AT3G47570 [OK]
14 AT5G39390&AT3G47580 [OK]
15 AT5G39390&AT5G20480 [OK]
Mission accomplished. (Time elapsed: 0:25)
1.4 yn00
还没看,之后补充
yn00
2. 一步计算
-
ParaAT.pl
-h
: 同源基因对文件
-n
: cds.fa
-a
: pep.fa
-p
: 指定多线程文件
-m
: 指定比对工具(clustalw2
|t_coffee
|mafft
|muscle
)
-g
: 去除比对有gap的密码子
-k
: 用KaKs_Calculator 计算kaks值
-o
: 输出结果的目录
-f
: 输出比对文件的格式(axt
|fasta
|paml
|codon
|clustal
)
$ grep ">" test.pep.fa | sed 's/>//' | awk '{a[NR]=$1;for(i=1;i<NR;i++){print $1"\t"a[i]}}' > geneid.pairs.txt
AT3G47110 AT3G47090
AT3G47570 AT3G47090
AT3G47570 AT3G47110
AT3G47580 AT3G47090
AT3G47580 AT3G47110
AT3G47580 AT3G47570
AT5G20480 AT3G47090
AT5G20480 AT3G47110
AT5G20480 AT3G47570
AT5G20480 AT3G47580
AT5G39390 AT3G47090
AT5G39390 AT3G47110
AT5G39390 AT3G47570
AT5G39390 AT3G47580
AT5G39390 AT5G20480
$ echo 12 > proc # 设置线程数
$ ParaAT.pl -h geneid.pairs.txt -n test.cds.fa -a test.pep.fa -o paraat -m mafft -f axt -g -p proc -kaks
$ ls paraat | head -4
AT3G47110-AT3G47090.cds_aln.axt
AT3G47110-AT3G47090.cds_aln.axt.kaks
AT3G47570-AT3G47090.cds_aln.axt
AT3G47570-AT3G47090.cds_aln.axt.kaks
参考:
简书 | Kaks_calculator计算ka/ks 值
简书 | kaks calculator批量计算多个基因的选择压力kaks值
简书 | Mr_我爱读文献 | 学会正确选择多序列比对(coding-sequences)软件
于振鹏 | PAML选择压力分析
陈连福 | 使用PAML软件利用密码子序列进行正选择分析
公众号 | 基因学苑 | 生物信息百Jia软件(三):Muscle
补充:
多序列比对的软件:muscle、mafft、clustalw......
现在常用的是前两者,上面用的mafft,这里看一下muscle
什么时候要用到多序列比对,它的结果能用于什么呢?
1. 用于构建基因树:
1.1 用trimAl 修剪比对结果,用iqtree
、fasttree
等进行建pep
树;
1.2 用pal2nal.pl
将cds
序列回帖到比对结果,用于构建cds
树。2. 用于构建物种树:将单拷贝基因家族的比对结果串联建树。
3. 共线性块上的基因对进行全局比对,回帖
cds
序列,用yn00
等计算ka、ks值。
如果不正确,希望批评指正!
$ conda search muscle
Loading channels: done
# Name Version Build Channel
muscle 3.8.31 0 bioconda
muscle 3.8.1551 1 bioconda
muscle 3.8.1551 2 bioconda
muscle 3.8.1551 h2d50403_3 bioconda
muscle 3.8.1551 h6bb024c_4 bioconda
muscle 3.8.1551 h7d875b9_6 bioconda
muscle 3.8.1551 hc9558a2_5 bioconda
muscle 5.1 h7d875b9_0 bioconda
muscle 5.1 h9f5acd7_1 bioconda
- 运行
$ cat test.fa
>gene1
MRLFLLLAFNALMQLEAYGFTDESDRQALLEIKSQVSESKRDALSAWNNSFP
>gene2
MGVPCIVMRLILVSALLVSVSLEHSDMVCAQTIRLTEETDKQALLEFKETSRVVLG
>gene3
MRLFLLLAFNALMLLETHGFTDETDRQALLQFKSQVSEDKRVVLSSWNHSFPLCNWKGVT
>gene4
MKLFLLLSFSAHLLLGETDRQALLEFKSQVSEGKRDVLSSWNNSFPLCNWKWVT
>gene5
MKLSFSLVFNALTLLLQVCIFAQARFSNETDMQALLEFKSQVSENNKREVLASWNHSSPF
>gene6
MKVCILVFAQARFSNETDMQALLEFKSQVTENKREVLASWNHSFPL
$ muscle -in test.fa -quiet | seqkit seq -w 0
>gene2
MGVPCIVMRLILVSALLVSVSLEHSDMVCAQTIRLTEETDKQALLEFKE-----TSRVVLG---------------
>gene5
-------MKLSFS--LVFNALTLLLQVCIFAQARFSNETDMQALLEFKSQVSENNKREVLASWNHSSPF-------
>gene6
-------MKVCIL---------------VFAQARFSNETDMQALLEFKSQVTE-NKREVLASWNHSFPL-------
>gene4
-------MKLFLL--LSFSAHL------LL------GETDRQALLEFKSQVSE-GKRDVLSSWNNSFPLCNWKWVT
>gene1
-------MRLFLL--LAFNALM------QLEAYGFTDESDRQALLEIKSQVSE-SKRDALSAWNNSFP--------
>gene3
-------MRLFLL--LAFNALM------LLETHGFTDETDRQALLQFKSQVSE-DKRVVLSSWNHSFPLCNWKGVT
- 其他
$ muscle
MUSCLE v3.8.1551 by Robert C. Edgar
http://www.drive5.com/muscle
This software is donated to the public domain.
Please cite: Edgar, R.C. Nucleic Acids Res 32(5), 1792-97.
Basic usage
muscle -in <inputfile> -out <outputfile>
Common options (for a complete list please see the User Guide):
-in <inputfile> Input file in FASTA format (default stdin)
-out <outputfile> Output alignment in FASTA format (default stdout)
-diags Find diagonals (faster for similar sequences)
-maxiters <n> Maximum number of iterations (integer, default 16)
-maxhours <h> Maximum time to iterate in hours (default no limit)
-html Write output in HTML format (default FASTA)
-msf Write output in GCG MSF format (default FASTA)
-clw Write output in CLUSTALW format (default FASTA)
-clwstrict As -clw, with 'CLUSTAL W (1.81)' header
-log[a] <logfile> Log to file (append if -loga, overwrite if -log)
-quiet Do not write progress messages to stderr
-version Display version information and exit
Without refinement (very fast, avg accuracy similar to T-Coffee): -maxiters 2
Fastest possible (amino acids): -maxiters 1 -diags -sv -distance1 kbit20_3
Fastest possible (nucleotides): -maxiters 1 -diags
看来我的版本比较老了