首先介绍一下megan中的blast2lca工具,这里主要作用是使用NCBI-blast或diamond与NCBI-nr库进行比对后,将比对结果通过lca算法运行,得到最大可能的物种注释结果(由于blast2lca在设计时没有多线程运行的功能,所以一般考虑切分文件之后并行运行)
lca的结果:
seq_00000001; ;d__Bacteria; 100;p__Proteobacteria; 92;c__Gammaproteobacteria; 85;o__Oceanospirillales; 31;f__Halomonadaceae; 15;g__Salinicola; 15;s__Salinicola sp. L3; 8;
seq_00000002; ;d__Bacteria; 100;p__Chloroflexi; 67;c__Anaerolineae; 33;o__unknown; 11;f__unknown; 11;g__unknown; 11;s__Anaerolineae bacterium; 11;
seq_00000003; ;d__Archaea; 100;p__Candidatus Bathyarchaeota; 100;c__unknown; 50;o__unknown; 50;f__unknown; 50;g__unknown; 50;s__miscellaneous Crenarchaeota group-1 archaeon SG8-32-3; 50;
seq_00000004; ;d__Bacteria; 100;p__Proteobacteria; 100;c__Deltaproteobacteria; 100;o__Desulfobacterales; 50;f__Desulfobacteraceae; 50;g__unknown; 50;s__Desulfobacteraceae bacterium; 50;
seq_00000005; ;d__Bacteria; 100;p__Proteobacteria; 100;c__Deltaproteobacteria; 100;o__unknown; 100;f__unknown; 100;g__unknown; 100;s__Olavius sp. associated proteobacterium Delta 1; 100;
seq_00000006; ;d__Bacteria; 100;p__Chloroflexi; 100;c__Anaerolineae; 100;o__unknown; 100;f__unknown; 100;g__unknown; 100;s__Anaerolineae bacterium SG8_19; 100;
seq_00000007; ;d__Archaea; 100;p__Candidatus Thermoplasmatota; 100;c__Thermoplasmata; 100;o__unknown; 67;f__unknown; 67;g__unknown; 67;s__Thermoplasmata archaeon; 67;
seq_00000008; ;
seq_00000009; ;d__Bacteria; 100;p__Proteobacteria; 70;c__Deltaproteobacteria; 40;o__unknown; 30;f__unknown; 30;g__unknown; 30;s__Deltaproteobacteria bacterium; 30;
seq_00000010; ;d__Bacteria; 100;p__Proteobacteria; 100;c__Deltaproteobacteria; 100;o__Desulfobacterales; 50;f__Desulfobacteraceae; 50;g__unknown; 50;s__Desulfobacteraceae bacterium; 50;
观察后发显以 " ; " 为分隔符,并且lca打分值在统计时可以删除
这里需要基因丰度(注意文中的物种结果和基因丰度不匹配,无法直接使用,读者最好以自己的数据运行):
Name sample1 sample2 sample3 sampleR1 sampleR2 sampleR3 sampleT1 sampleT2 sampleT3 sampleM1 sampleM2 sampleM3 sampleA1 sampleA2 sampleA3 sampleB1 sampleB2 sampleB3 sampleSPA1 sampleSPA2 sampleSPA3 sampleSPB1 sampleSPB2 sampleSPB3 sampleSPC1 sampleSPC2 sampleSPC3 samplePA1 samplePA2 samplePA3 samplePB1 samplePB2 samplePB3 samplePC1 samplePC2 samplePC3
seq_11702791 0 0.096039 0.075002 0 0.128542 0.809282 0 0.084837 0 00.498057 0 0 0 0.229096 0 0 0 0.085074 0 0.357043 0 00 0 0.282702 0 0.254728 0.233158 0.255244 0.36337 0 0 0.389254 0.200327 0.233231
seq_11702784 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0
seq_11702781 0 0.214329 0 0 0 0 0 0 0 0 0 0 0.135536 0 0.129419 0 0.151596 0 0 0.084597 0 0 0 0 0 00 0 0 0 0.411339 0.119677 0.192297 0 0 0
seq_11702777 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0
seq_11702774 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0.126775 0 0 0 0 0 0 0 0 0 0 0.49159 0 00.222713 0 0 0 0
seq_11702773 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0.329523 0 0 0 0 0 0 0 0 0 0.456553 1.00284 0 0.793967 0 0 0 0
seq_11702771 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0.36227 0.172113 0 0 0.421298 0 0 0 0 0.336708 0.667988 0.116559 0.101392 0.651608 0 0 0.112201
seq_11702770 0 0 0 0 0.05274 0 0 0.214383 0.072671 0.1093 0 0.112557 0.09828 0 0 0 0 0 0.070841 0 0.148879 0 0.293738 0.273107 0.078943 0 0.146785 0 0 0 0 0.260459 0 0.720536 0.165462 0
seq_11702766 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0.132232 0.093209 0 0 0 0 0 0 0 0 0 0.640410.702353 0 0 0 0 0 0
拿到这两个数据以后,考虑统计界门纲目科属种在每个样本中的相对分度之和,找到在每个样本中相对丰度最高的Top30物种,用python处理以下这两个表格(注意这里我修改列名的方式只适合我的文件结构,要根据自己的表格结果自行调整):
#!/usr/bin/python
import pandas as pd
import os
def get_abun_sum (df):
name=df.iloc[:,0].unique().tolist()
num_name=len(name)
result={}
for i in range(0,num_name):
tem_df=df[df.iloc[:,0]==name[i]]
abunsum=tem_df.iloc[:,1].sum()
result[name[i]] = abunsum
return result
def get_df(into):
os.makedirs("result")
a=into.columns
for i in range(1,8):
for x in range(9,45) :
df=into[[a[i],a[x]]]
df2=get_abun_sum(df)
df3=pd.DataFrame([df2]).T
filename='./result/'+str(a[i])+"_"+str(a[x])+'.csv'
df3.to_csv(filename,header=0)
print("Finish "+filename)
def get_data(lcafile,abunfile):
lca=pd.read_table(lcafile,sep=";",header=None,on_bad_lines='skip',low_memory=False)
lca=lca.drop(lca.columns[[1,3,5,7,9,11,13,15,16]],axis=1)
lca=lca[lca.iloc[:,1].notnull()]
lca.columns=["geneid","Kingdom","Phylum","Class","Order","Family","Genus","Species"]
#lca.to_csv("lca_filter.csv",sep=",",header=0,index=0)
abun=pd.read_table(abunfile)
df=pd.merge(lca,abun,left_on="geneid",right_on="Name",how="inner")
return df
def main():
into = get_data("lca_result","gene_abundance.TPM")
#have a try !
get_df(into)
main()
结果输出到了运行目录下result的文件夹中,这里对lca不好的结果进行了删除,删除了约4.5%的数据
画图以后有空搞