【跟着manual学软件】MCScan-transposed

软件介绍

首先放官网：MCScanX-transposed: detecting transposed gene duplications based on multiple collinearity scans
文献：MCScanX-transposed: detecting transposed gene duplications based on multiple colinearity scans

他能干什么

首先放文章摘要：

Gene duplication occurs via different modes such as segmental and single-gene duplications. Transposed gene duplication, a specific form of single-gene duplication, ‘copies’ a gene from an ancestral chromosomal location to a novel location. MCScanX is a toolkit for detection and evolutionary analysis of gene colinearity. We have developed MCScanX-transposed, a software package to detect transposed gene duplications that occurred within different epochs, based on execution of MCScanX within and between related genomes. MCScanX-transposed can be also used for integrative analysis of gene duplication modes for a genome and to annotate a gene family of interest with gene duplication modes.

首先理解什么是Transposed gene duplication（我理解为基于转座事件的基因复制），说人话就是一个基因从祖先染色体位置『跳』到一个全新的位置。
MCScanX-transposed有完整的MCScanX的功能，基于MCScanX，MCScanX-transposed可以综合的分析基因复制事件和对基因家族成员复制模型进行检测。

本文想干啥

既然作者都说了MCScanX-transposed的特点，那我的主要关注点就是对于复制事件的分析了，共线性的问题，MCScanX的教程多如牛毛，就不造轮子了。

软件安装

官网：

unzip MCScanX-transposed.zip
cd MCScanX-transposed
make

推荐关注风槿如画_7847

之前再安装MCScanX时，mac遇到了一些问题，需要再编译之前修改几个文件。
详情见：MCScanX安装、报错解决方法及简单使用，当然macos用户还需要安装xcode.
MCscanX-transpose好像解决了这些问题，直接下载好源码，make
具体参考：MCScanX-transposed安装及使用

依赖1：Bio::seqIO

在后续脚本的使用中会调用Bioperl的Bio::seqIO模块，有些操作系统不知道是不是有安装的，但是mac没有，或许你现在不了解怎么回事，也可以不用管，但是当你按照下面的命里一跑就会提示报错，报错后google吧... 可以根据安装Bioperl最基本模块Bio::SeqIO安装
这里我记录下我的解决方案：

## 首先确保你装了cpan
cpan
## 安装提示选择mirror等一系列配置，基本是一路yes下来的，有一个部分会询问你是当前用户安装还是什么的，我选了sudo
Install Bio::SeqIO
## 这个过程巨长，跟网速有关，不能做到无痛，两个原因，首先我选择了sudo，中间需要输入几次密码，还有再编译的时候他会询问你是否编译，也是一路yes

依赖2：clustalw

这里装clustalw的方式五花八门，我为了省事用conda装的，记得之前直接

conda install clustalw -y

就可以了，但是报错提示在我提供的频道里找不到这个软件，于是上google搜索关键词conda clustalw发现：https://anaconda.org/biobuilds/clustalw

conda install -c biobuilds clustalw

这里坑爹的是clustalw好像更新成了clustalw2，在bash下clustalw也没有这个alias，用Tab补齐后发现变成了clustalw2，原本想着不行重新装一个clustalw算了，后来打开看了下也没有多大区别，所以搞了个骚操作

vim add_ka_ks.pl
## 用/clustalw搜索了下脚本中调用clustalw的代码，发现:
system("clustalw -infile='temp7734.pro'");
## 然后按i进入插入模式，把clustalw改成clustalw2 ESC  :wq保存退出。

当你也可以用什么alias之类的或者直接装一个clustalw

依赖3 JDK

在编译之前需要安装JDK，不然下游分析包中的java无法编译成功，下载方法直接百度就好了

软件使用

示范数据

首先可以打开软件包中的data文件夹，官方给的实例数据，这些不仅是实例数据，更是一个脑残版数据格式实例，仿佛在说：别TM老问我数据啥格式，你打开一个看看依样画葫芦就行！

2) The input data include:
a) “at.gff”, chromosomal positions of A. thaliana genes
b) “at.blast”, intraspecies blastp in A. thaliana
c) “at_al.gff”, chromosomal positions of A. thaliana and A. lyrata genes
d) “at_al.blast”, interspecies blastp between A. thaliana and A. lyrata
e) “at_br.gff”, chromosomal positions of A. thaliana and Brassica rapa genes
f) “at_br.blast”, interspecies blastp between A. thaliana and Brassica rapa
g) “at_cp.gff”, chromosomal positions of A. thaliana and Carica papaya genes
h) “at_cp.blast”, interspecies blastp between A. thalianaand Cariaca papaya
i) “at_pt.gff”, chromosomal positions of A. thaliana and Populus trichocarpa genes
j) “at_pt.blast”, interspecies blastp between A. thaliana and Populus trichocarpa
k) “at_vv.gff”, chromosomal positions of A. thaliana and Vitis vinifera genes
l) “at_vv.blast”, interspecies blastp between A. thaliana and Vitis vinifera
m) “at.cds”, A. thaliana coding sequences
n) “mads.genes”, A. thaliana MADS box genes
o) “mads.newick”, bracket tree of A. thaliana MADS box genes

准备文件

其实需要准备xx.blast, xx.gff, xx.cds, xx.genes以及xx.newick这几个文件就可以了，做blast的话，可以conda下面安装blast，然后把准备好的蛋白序列进行blast，我做的是target to outgroup,也就是target作为query，outgroup作为database。
如果不习惯代码行操作，推荐使用TBtools的blast GUI wrapper目录下的Two Sequence File工具，选择参数的时候把输出格式选择为table，把NumofHits选择为5. 软件Q&A部分推荐blast结果no more than 5的原因。
gff文件和MCScanX的一样，具体自己翻翻简书，有写好的命令行一键把原始gff转换成MCScanX的gff格式的。
cds文件就是测序玩组装好的cds文件，格式为fasta
xxx.genes xxx家族的基因ID
xxx.newick，这个不用说，就是这个家族基于蛋白序列的系统进化树

核心代码

MCScanX-transposed
This program carries out detection of transposed gene duplications within different epochs and classificaiton of gene duplication modes.
Usage:

perl MCScanX-transposed.pl -i data_directory -t target_species -c outgroup_species(comma_delimited) -o output_directory

Optional:
-x number_of_different_epoches (if specified, outgroup species must be provided in the order of divergence from the target species(most recent first), default: 1, only consider the transposed duplications that occurred after the divergence between target species and all outgroups )
-a 1 or 0(are segmental duplicates ancestral loci or not? default: 1, yes)
-d number_of_genes(maximum distance to call proximal, default: 10)"
IMPORTANT:Users must prepare the input files by carefully reading the following instructions (1-4).

All input files should be stored under ONE folder(the "data_directory" parameter)

For the target genome in which gene duplicaiton modes will be classified, please prepare two input files:
a) "[target_species].gff", a gene position file for the target species, following a tab-delimited format: "sp&chr_NO gene starting_position ending_position"
b) "[target_species].blast", a blastp output file (m8 format) for the target species (self-genome comparison).

For each outgroup genome, please prepare two input files:
a) "[target_species][outgroup_species].gff", a gene position file for the target_species and outgroup_species, following a tab-delimited format:"sp&chr_NO gene starting_position ending_position"
b) "[target_species][outgroup_species].blast", a blastp output file (m8 format) between the target and outgroup species (cross-genome comparison).

For example, assuming that you are going to classify gene duplication modes in Arabidopsis thaliana (ID: at), using Brassica rapa (ID: br) and Carica papaya (ID: cp) as outgroups, you need to prepare 6 input files: "at.gff","at.blast", "at_br.gff", "at_br.blast","at_br.gff","at_cp.gff" and "at_cp.blast".

主体包括了4个参数

-i你存放数据的目录，注意测试发现必须吧数据放在软件路径下的data路径（文件夹）下，可以在data下面建立子目录，比如我做的棉花，那你的路径

~/xxx/MCScanX-transposed/result/cotton

-t 这个和MCScanX操作一样，如果按照示例数据来，你要比较拟南芥和其他那么，你所有拟南芥的数据均是以at.xxx命名的，那么这里跟的就是at，严格按照文件名不要发挥想象写成什么At,Arabidopsis,ninanjie😂.
-c 比对物种的简称，每个物种简称直接用逗号隔开，注意是英文逗号
-o 输出文件，包括位置，前缀等等。
还有三个可选参数
-x 遗传学没学好，这里理解有些困难，epoches的中文意思翻译过来是世代，纪元（如有错误请指出，谢谢）如果指定-x参数，outgroup中的物种必须按照与目标物种世代关系远近排列，关系越近，排名越靠前。默认值为1，意思是target物种中转座重复只发生在target物种和其他物种分化之后。
-a 片段重复是否在祖先位点...... 默认是1
-d 定义为proximal的最大基因个数，默认是10。(能理解啥意思，翻译过来感觉贼别扭ಥ_ಥ)

Examples:

perl MCScanX-transposed.pl -i data -t at -c al,br,cp,pt,vv -o result/test1 -x 3
perl MCScanX-transposed.pl -i data -t at -c br,cp,pt,vv -o result/test2

Different modes of gene duplication including segmental, tandem, proximal and transposed are output as separate files (".pairs") with each line containing one gene duplication. In transposed duplications,the first duplicated gene is the transposed locus. Unique duplicates belonging to each mode are also output as a separate file (".genes"). Interim collinearity files generated by MCScanX are available under the output directory.

Google了一下at和al的关系：

Arabidopsis lyrata is the closest well-characterized relative of the model plant Arabidopsis thaliana, thought to have separated ~5 mya

这里可以看到at和al的亲缘关系是要比br,cp,pt,vv等要近的多的，所以在有al的比较中修改了-x的值为3，但是在没有al时，认为这些转座重复的产生是在物种分化之后形成的，这里就需要用到默认的-x值1。当认为转座重复事件发生在物种分化之前的时候，输出结果中会计算目标物种和outgroup中每个物种的同源基因复制关系：

The results are under the directory “result/at_result”. Segmental, tandem, proximal and transposed duplications are in the files “at.segmental.pairs”, “at.tandem.pairs”, “at.proximal.pairs” and “at.transposed.pairs” respectively. Segmental, tandem, proximal and transposed duplicates are in the files “at.segmental.genes”, “at.tandem.genes”, “at.proximal.genes” and “at.transposed.genes” respectively.
The transposed duplications that occurred after A. thaliana-A. lyrata divergence, between A. thaliana-A. lyrata and Arabidopsis-Brassica divergence and between Arabidopsis-Brassica and Arabidopsis-Carica divergence are in the files "at.transposed_after_al.pairs", "at.transposed_between_al_br.pairs" and "at.transposed_between_br_cp.pairs" respectively.
Note that interim collinearity files are also available, including “at. Collinearity”, “at_al.collinearity”, “at_br.collinearity”, “at_cp.collinearity”, “at_pt.collinearity” and “at_vv.collinearity”.

计算KaKs

不要问我计算的结果和kakscalculator等其他软件哪个更好，作为一个调包侠这问题超纲了，如果你懂计算原理，自己用sublime，editplus等软件打开add_ka_ks.pl这个脚本看，里面有计算过程。
看脚本这里需要用到Bio::SeqIO和clustalw,如果你之前没有装的话这里会报错，所以如果你跑这个脚本报错的话，自行百度Google，stackoverflow，biostar上找方案吧。我的解决方案不一定适合你，特别是跨平台。

比如要计算at和al分化之后拟南芥中重复基因的kaks，需要执行：

“perl add_ka_ks.pl -d data/at.cds -i result/at_result/at.transposed_after_al.pairs -o result/at.transposed_after_al.pairs.kaks”

我做的陆地棉的结果如下：

Transposed	Location	Parental	Location	E-value	Ka	Ks
Gh_A01G000200	A01:60647	Gh_Contig00288G000400	Contig00288:58357	0.0	0.0861685543371287	0.102513449889525
Gh_A01G015700	A01:1329366	Gh_D11G230300	D11:32648963	7.48e-176	0.107481363628075	0.100119438035867
Gh_A01G016300	A01:1392126	Gh_D11G091700	D11:7573295	0.0	0.150062951494286	0.93426202813272

基因家族重复事件

这个很简单，需要你把家族成员的ID写到一个文件中，然后运行：

perl detect_dup_modes_for_a_family.pl -i data/mads.genes -d result/at_result/at -o result/mads.duplication.modes

结果就是你指定的输出文件，参数和上面的核心参数也一致。最后会列出一个表，表中会注明复制事件的模式，例如是串联重复（tandem），片段重复（segmental），近端重复(proximal)还是转座重复（transposed）。

在进化树上显示基因家族复制事件

很奇怪，用示例数据可以直接跑出结果，如下：

mads.tree.modes.png

但是我尝试了下自己的数据发现会报错：

Exception in thread "main" java.lang.NullPointerException
    at annotate_tree_with_dup_modes.recursion(annotate_tree_with_dup_modes.java:144)
    at annotate_tree_with_dup_modes.processtree(annotate_tree_with_dup_modes.java:207)
    at annotate_tree_with_dup_modes.main(annotate_tree_with_dup_modes.java:379)

经过各种尝试还是失败.... 没有接触过java，真的不太了解这个错误的原因，但是上面这个图实现的原理也很简单，完全可以用其他软件或者其他方式实现，所以不是刚需，也没有再纠结。喜欢折腾的小伙伴可以折腾下！

不同世代转座事件的可视化

同样这个也没有尝试，因为我的结果中不存在transposed-duplication，全部都是segmental或者tandem，放上示例数据结果：

after_al.png

总结

MCScanX-Transposed 个人感觉是一款非常容易上手，门槛不高的软件（调包侠的角度），至于结果，毕竟不懂原理，不敢妄加评论。

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 206,214评论 6赞 481
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 88,307评论 2赞 382
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 152,543评论 0赞 341
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 55,221评论 1赞 279
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 64,224评论 5赞 371
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 49,007评论 1赞 284
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 38,313评论 3赞 399
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,956评论 0赞 259
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 43,441评论 1赞 300
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,925评论 2赞 323
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 38,018评论 1赞 333
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,685评论 4赞 322
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 39,234评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 30,240评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,464评论 1赞 261
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 45,467评论 2赞 352
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,762评论 2赞 345