Orthologs
(直系同源基因): homolog的一种。在两个物种形成之前,是它们共同祖先里面的一个基因,跟着新主子去了新形成的物种,之后可能各自有一些不同的变化。是不同物种里的相同基因。
OrthoFinder运行命令
#BSUB -J blast
#BSUB -n 10
#BSUB -R span[hosts=1]
#BSUB -o %J.out
#BSUB -e %J.err
#BSUB -q normal
cd /public/home/qtxu/work/OrthologFind/proteome
../OrthoFinder-2.5.4/orthofinder.py -f data/ -S diamond -og -t 10
1. data文件保护需要的蛋白序列文件:
ls data/
Ara_longgest_protein.fa Pal_ted_proteins.fa
内如如下:
###head Ara_longgest_protein.fa
>AT4G21570
MMDLTKLKPPQITFYCSAFSVLLTLHFTIQLVSQHLFHWKNPKEQKAILIIVLMAPIYAVVSFIGLLEVKGSETFFLFLESIKECYEALVIAKFLALMYSYLNISMSKNILPDGIKGREIHHSFPMTLFQPHVVRLDRHTLKLLKYWTWQFVVIRPVCSTLMIALQLIGFYPSWLSWTFTIIVNFSVSLALYSLVIFYHVFAKELAPHNPLAKFLCIKGIVFFVFWQGIALDILVAMGFIKSHHFWLEVEQIQEAIQNVLVCLEMVIFAAVQKHAYHAGPYSGETKKKLDKKTE
>AT2G05410
MAYPKGINKAHDSFSLFLNVPDNESLPTGWRRHAKVSFSLVNQGSEKLSQRKVTQHWFVQKAFTWGFPVMITHTELNAKMGFLVNGELKVVAKIEVLEVVGKLDVSKESSPIMKTIDVNGFQVLPSQVDSVKRLFEKNLDIVSKFRLKNPYLKTACMNLLLSLTETLCQSPQELSNDGLSDAGVALAYLIETGLKLDWLEKKLDELKEKKKKEESCLVRLREMDEQLQPFKKRCLDIEDQISKEKEELLAAREPLSLYDDIDNNV
>AT5G47770
MSVSCCCRNLGKTIKKAIPSHHLHLRSLGGSLYRRRIQSSSMETDLKSTFLNVYSVLKSDLLHDPSFEFTNESRLWVDRMLDYNVRGGKLNRGLSVVDSFKLLKQGNDLTEQEVFLSCALGWCIEWLQAYFLVLDDIMDNSVTRRGQPCWFRVPQVGMVAINDGILLRNHIHRILKKHFRDKPYYVDLVDLFNEVELQTACGQMIDLITTFEGEKDLAKYSLSIHRRIVQYKTAYYSFYLPVACALLMAGENLENHIDVKNVLVDMGIYFQVQDDYLDCFADPETLGKIGTDIEDFKCSWLVVKALERCSEEQTKILYENYGKPDPSNVAKVKDLYKELDLEGVFMEYESKSYEKLTGAIEGHQSKAIQAVLKSFLAKIYKRQK
2. 产生的结果在data文件夹里面的OrthoFinder
ls data/
Ara_longgest_protein.fa OrthoFinder Pal_ted_proteins.fa
3. 最终结果是Orthogroups文件夹下的Orthogroups.tsv
ls data/OrthoFinder/Results_Oct13/Orthogroups/Orthogroups.tsv
data/OrthoFinder/Results_Oct13/Orthogroups/Orthogroups.tsv
对产生的结果进行整理
1.选取对应的orthologs(红色框内),保存为tab分割的txt文件。
注意拟南芥ID放在第二列了,然后
,
后面没有空格。
A0A024FLK4,Q10MK5 AT5G10920
A0A0N7KC65,A0A0P0UX72,A0A0P0VQW4 AT3G54360
A0A0N7KC82,A0A0P0UXG8 AT5G10630
A0A0N7KC83,A0A0P0UXE2,A0A0P0UXI3,A0A0P0UXP7,A0A0P0UY53,Q0JR80,Q656W4,Q9FTF3
A0A0N7KCI1,A0A0P0UZQ9,Q5QMY9 AT2G25760
A0A0N7KCI8 AT4G18750
A0A0N7KCX5,Q5Z8Q9 AT4G02390
2. 提供IDs,比如我用水稻所有Kac蛋白 寻找到了其所有的拟南芥的同源蛋白,我想进一步寻找水稻Kac对应的拟南芥Kac蛋白,那么我需要提供拟南芥是Kac蛋白的Ids(每行一个ID),如下:
AT1G01050
AT1G01090
AT1G01120
AT1G01300
AT1G01610
AT1G01740`
进行提取,Results.txt文件即是 两个Kac蛋白对应的IDs。
perl ID_analysis.pl Arabi_pal_protein_IDs.txt Arabi_Rice_Pal_proteins_1.txt > Results.txt
Perl脚本如下:
open FA,"$ARGV[0]";
while(<FA>){
chomp;
push @yest, $_;
#print"$_\n";
}
open FA1,"$ARGV[1]";
while(<FA1>){
chomp;
($riceid,$target)=split /\t/,$_;
#print "$riceid\t$target\n";
@rice=split(',',$riceid);
@array=split(',', $target);
foreach $yest(@yest){
foreach $id(@array){
if($yest eq $id){
foreach $rice(@rice){
print"$rice\t$id\n";
}
}
}
}
}