BUSCO是Benchmarking Universal Single-Copy Orthologs(通用单拷贝同源基因基准)的缩写,基于基因进化(有参比对)评估基因组组装和注释完整性的开源python软件。
文献:
文章:BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015
引用:4695
BOOK:BUSCO: Assessing Genome Assembly and Annotation Completeness. Methods in Molecular Biology 2019
摘要:
Motivation: Genomics has revolutionized biological research, but quality assessment of the resulting assembled sequences is complicated and remains mostly limited to technical measures like N50.
Results: We propose a measure for quantitative assessment of genome assembly and annotation completeness based on evolutionarily informed expectations of gene content
. We implemented the assessment procedure in open-source software, with sets of Benchmarking Universal Single-Copy Orthologs, named BUSCO.
基因组组装评估方法少,BUSCO开源且好用。
方法:
官网:https://busco.ezlab.org/
MANUAL: https://busco.ezlab.org/busco_userguide.html
conda安装:
conda:https://anaconda.org/bioconda/busco
选一即可,可能是v4.1.2
conda install -c bioconda busco
conda install -c bioconda/label/broken busco
conda install -c bioconda/label/cf201901 busco
bioconda安装最新版v5.1.2,see manual
# 没有镜像的话,添加镜像
conda config --show
conda config --add channels conda-forge
# conda安装
conda create -n busco
conda activate busco
conda install -c bioconda -c conda-forge busco=5.1.2
busco --help
busco --version
# BUSCO 5.1.2
数据库:
更多老哥下了植物的参考基因组,链接似乎不好用了?
# 植物的BUSCO的数据库
wget -c https://busco.ezlab.org/datasets/embryophyta_odb9.tar.gz
orthodb: https://www.orthodb.org/?page=filelist 里似乎有很多数据?
MANAUAL中提供了lineage数据源:
https://busco-data.ezlab.org/v5/data/,发现:
是V5最新版的数据库,没错了
https://busco-data.ezlab.org/v5/data/lineages/,发现:
2021本月最新版,各个物种任意选择,下载bacteria_odb10,并查看:
wget -c https://busco-data.ezlab.org/v5/data/lineages/bacteria_odb10.2020-03-06.tar.gz
tar -zxvf bacteria_odb10.2020-03-06.tar.gz
cd bacteria_odb10
BUSCO使用:
manual里的Automated lineage selection模式
busco -m MODE -i INPUT -o OUTPUT --auto-lineage
busco -m MODE -i INPUT -o OUTPUT --auto-lineage-prok
# or ignoring eukaryotes to save runtime, if compatible with your experimental goal.
busco -m MODE -i INPUT -o OUTPUT --auto-lineage-euk
# or ignoring non-eukaryotes to save runtime, if compatible with your experimental goal.
manual推荐的靶向lineage模式
db_busco="/database/BUSCO/bacteria_odb10"
busco --in AF04-12.fna \
--lineage_dataset $db_busco \
--out ./output/ \
-m genome --offline
结果报错:
顾名思义,不能有slash,需要更改配置文件,安全起见别这样做。去掉slash即可正常。对于批处理,只需不断进出新建的文件夹即可。
busco --in AF04-12.fna \
--lineage_dataset $db_busco \
--out output \
-m genome --offline
结果:
full_table.tsv
# BUSCO version is: 5.1.2
# The lineage dataset is: bacteria_odb10 (Creation date: 2020-03-06, number of genomes: 4085, number of BUSCOs: 124)
# Busco id Status Sequence Gene Start Gene End Strand ScoreLength OrthoDB url Description
4421at2 Complete AF04-12.Scaf40_36 46725 51011 + 1675.3 1205 https://www.orthodb.org/v10?query=4421at2 DNA-directed RNA polymerase subunit beta'
9601at2 Complete AF04-12.Scaf40_35 42874 46686 + 1169.7 804 https://www.orthodb.org/v10?query=9601at2 DNA-directed RNA polymerase subunit beta
26038at2 Complete AF04-12.Scaf8_42 54773 58477 + 212.5371 https://www.orthodb.org/v10?query=26038at2 phosphoribosylformylglycinamidine synthase
91428at2 Complete AF04-12.Scaf45_20 22437 25052 + 540.6530 https://www.orthodb.org/v10?query=91428at2 alanine--tRNA ligase
95696at2 Complete AF04-12.Scaf4_63 73584 75617 + 714.7504 https://www.orthodb.org/v10?query=95696at2 excinuclease ABC subunit B
143460at2 Complete AF04-12.Scaf1_51 58613 60415 + 512.5441 https://www.orthodb.org/v10?query=143460at2 GTP-binding protein
182107at2 Complete AF04-12.Scaf17_16 11979 13760 + 709.2491 https://www.orthodb.org/v10?query=182107at2 elongation factor 4
missing_busco_list.tsv
POG091H008J
POG091H00BL
POG091H00TK
...............这里其实没有,嘎嘎
short_summary.txt
# BUSCO version is: 5.1.2
# The lineage dataset is: bacteria_odb10 (Creation date: 2020-03-06, number of genomes: 4085, number of BUSCOs: 124)
# Summarized benchmarking in BUSCO notation for file /hwfssz5/ST_META/P18Z10200N0423_ZYQ/MiceGutProject/hutongyuan/analysis/platform/test/AF04-12.fna
# BUSCO was run in mode: genome
# Gene predictor used: prodigal
***** Results: *****
C:100.0%[S:97.6%,D:2.4%],F:0.0%,M:0.0%,n:124
124 Complete BUSCOs (C)
121 Complete and single-copy BUSCOs (S)
3 Complete and duplicated BUSCOs (D)
0 Fragmented BUSCOs (F)
0 Missing BUSCOs (M)
124 Total BUSCO groups searched
Dependencies and versions:
hmmsearch: 3.1
prodigal: 2.6.3
合并BUSCO结果:
## BUSCO 结果统计
task="illumina"
touch BUSCO/${task}_busco.txt
echo -e "id\tc\ts\td\tf\tm" >> BUSCO/${task}_busco.txt
for i in `cat 76_strain_id.list`;
do
c=`cat BUSCO/$task/$i/run_bacteria_odb10/short_summary.txt | grep "Complete BUSCOs" | awk '{print $1}'`
s=`cat BUSCO/$task/$i/run_bacteria_odb10/short_summary.txt | grep "Complete and single-copy BUSCOs" | awk '{print $1}'`
d=`cat BUSCO/$task/$i/run_bacteria_odb10/short_summary.txt | grep "Complete and duplicated BUSCOs" | awk '{print $1}'`
f=`cat BUSCO/$task/$i/run_bacteria_odb10/short_summary.txt | grep "Fragmented BUSCOs" | awk '{print $1}'`
m=`cat BUSCO/$task/$i/run_bacteria_odb10/short_summary.txt | grep "Missing BUSCOs" | awk '{print $1}'`
echo -e "$i\t$c\t$s\t$d\t$f\t$m" >> BUSCO/${task}_busco.txt
echo -e "\033[32m $i done... \033[0m"
done
可视化:
这个呢需要某个脚本,官网是这么干的,自己捯饬一下也行,反正我没做了。
cp XX1/short_summary.*.lineage_odb10.XX1.txt BUSCO_summaries/.
cp XX2/short_summary.*.lineage_odb10.XX2.txt BUSCO_summaries/.
cp XX3/short_summary.*.lineage_odb10.XX3.txt BUSCO_summaries/.
python3 scripts/generate_plot.py –wd BUSCO_summaries
python3 scripts/generate_plot.py –wd /full/path/to/my/folder/BUSCO_summaries
更多:
BUSCO - 组装质量评估