BLAST/Diamond XML结果文件格式:
脚本:blastxml_to_tabular.py
网址:https://github.com/peterjc/galaxy_blast/blob/master/tools/ncbi_blast_plus/blastxml_to_tabular.py
参数说明:
====== ========= ============================================
Column NCBI name Description
------ --------- --------------------------------------------
1 qseqid Query Seq-id (ID of your sequence)
2 sseqid Subject Seq-id (ID of the database hit)
3 pident Percentage of identical matches
4 length Alignment length
5 mismatch Number of mismatches
6 gapopen Number of gap openings
7 qstart Start of alignment in query
8 qend End of alignment in query
9 sstart Start of alignment in subject (database hit)
10 send End of alignment in subject (database hit)
11 evalue Expectation value (E-value)
12 bitscore Bit score
====== ========= ============================================
The additional columns offered in the Galaxy BLAST+ wrappers are:
============================================================
Column NCBI name Description
------ ------------- -------------------------------------------
13 sallseqid All subject Seq-id(s), separated by ';'
14 score Raw score
15 nident Number of identical matches
16 positive Number of positive-scoring matches
17 gaps Total number of gaps
18 ppos Percentage of positive-scoring matches
19 qframe Query frame
20 sframe Subject frame
21 qseq Aligned part of query sequence
22 sseq Aligned part of subject sequence
23 qlen Query sequence length
24 slen Subject sequence length
25 salltitles All subject titles, separated by '<>'
============================================================
默认输出12列,最多输出25列,也可自由选择,代码如下:
#输出12列
python blastxml_to_tabular.py -o output.txt -c std input.xml
#输出25列
python blastxml_to_tabular.py -o output.txt -c ext input.xml
#自定义输出
python blastxml_to_tabular.py -o output.txt -c 'qseqid,sseqid,pident' input.xml
注意:结果文件是没有表头的,即上述列名,可自行添加
输出文件改成csv的话每一行的结果会挤在一个单元格,需要csv可做个转换
可参考如下脚本:txt_to_csv.py
import csv
input_filename = input("Enter input file name: ")
output_filename = input("Enter output file name: ")
# 打开文本文件并将其转换为二维数组
with open(input_filename, 'r') as file:
rows = [line.strip().split('\t') for line in file]
# 将列名与数据合并,构建新的二维数组
col_names = ['qseqid','sseqid','pident','length','mismatch','gapopen','qstart','qend','sstart','send','evalue','bitscore','sallseqid','score','nident','positive','gaps','ppos','qframe','sframe','qseq','sseq','qlen','slen','salltitles'] # 列名列表
data = [col_names] + rows
# 打开 CSV 文件并写入数据
with open(output_filename, 'w', newline='') as file:
writer = csv.writer(file)
writer.writerows(data)
直接运行即可
python txt_to_csv.py
最后:这个工具包有在线版(https://usegalaxy.org/),在左侧tools里输入xml就可以找到