最近需要尽可能的下载细菌基因组,ncbi ftp站点(ftp://ftp.ncbi.nlm.nih.gov/genomes)里面"all"这个目录可以下载,不过"all"目录还有古细菌之类的哈。如果只想取Refseq或genbank数据库其一,可以通过以下方式:
#下载Refseq/genbank数据库中的细菌基因组数据
#截止20190106,有140681条基因组数据
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt
ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt
#文件第12列是拼接状态(Complete Genome/Scaffold/contig),第20列是ftp路径,下面是下载组装完整的基因组信息
awk -F "\t" '$12=="Complete Genome" && $11=="latest"{print $20}' assembly_summary.txt > ftpdirpaths
awk 'BEGIN{FS=OFS="/";filesuffix="genomic.gbff.gz"}{ftpdir=$0;asm=$10;file=asm"_"filesuffix;print ftpdir,file}' ftpdirpaths > ftpfilepaths
cut -d / -f 11 ftpfilepaths | paste - ftpfilepaths | while read a b;do echo "wget -c -nd -r -np -k -L -p -nd -P genbank $b && gzip -d genbank/$a";done >run.download.sh
#脚本只下载了gbff文件,当然,如果想要gff,fna,faa通过小脚本转就可以了,或者加上“-A faa.gz,fna.gz,gff.gz,gpff.gz”
真的超级占存储空间,我只下载gbff文件,需要其他格式的文件再简单格式转化就可以了。同时,下载一批立马处理一批。
Refseq和genbank的区别
Refseq下载的是GCF开头的文件,而genbank下载的是GCA开头的文件。
具体看这里:https://zhuanlan.zhihu.com/p/20749737
GenBank是核苷酸数据库,RefSeq是基因数据库 ,具体哪个更全?互为补充,以前做分析的时候发现有些细菌基因组在GenBank里面没有在RefSeq有,还有可能两个数据库里面某个基因组只有一个fna文件没有注释,这种情况如果遇到自己动手用prokka十分钟就能解决。
生信小脚本收藏癖,有时遇到比较实用的脚本会存下来放在自己环境变量bin里面,积累会使以后的分析变得更加高效
附上两个网上看到好用的脚本
1.GenBank转faa格式
#!/usr/bin/env perl
# This program reads gbk files and extracts all amino acid sequences from the
# /translation fields into a FASTA file. The FASTA header contains a sequential
# number followed by the taxon id, which is extracted from the
# /db_xref="taxon:<ID>" field. Only letters in the 20 letter amino acid
# alphabet are retained in the FASTA file.
#
# author: Peter Menzel
#
# This file is part of Kaiju, Copyright 2015-2017 Peter Menzel and Anders Krogh
# Kaiju is licensed under the GPLv3, see the file LICENSE.
#
use strict;
use warnings;
use IO::Uncompress::AnyUncompress qw(anyuncompress $AnyUncompressError) ;
my $t = 0;
my $taxid;
my $protein_id;
if(!defined $ARGV[1]) { die "Usage: $0 infile.gbk outfile.faa"; }
open(OUT,">",$ARGV[1]) or die "Could not open file $ARGV[1] for writing.";
my $in_fh = new IO::Uncompress::AnyUncompress $ARGV[0] or die "Opening input file failed: $AnyUncompressError\n";
while(<$in_fh>) {
chomp;
if(m,/db_xref="taxon:(\d+)",) {
$taxid = $1;
}
elsif(m,/protein_id="([^"]+)",) {
$protein_id=$1;
}
elsif(m,\s+/translation="([^"]+)",) {
if(!defined($taxid)) { die "No taxon id found in gbk file $ARGV\n";}
print OUT ">$protein_id\_$taxid\n";
my $seq = $1;
$seq =~ tr/BZ/DE/; # a.a. alphabet specifies `B' matches `N' or `D', and `Z' matches `Q' or `E.', here we use substitution with higher score
$seq =~ s/[^ARNDCQEGHILKMFPSTWYV]//gi;
print OUT "$seq\n";
}
elsif(m,\s+/translation="([^"]+)$,) {
if(!defined($taxid)) { die "No taxon id found in gbk file $ARGV\n";}
print OUT ">$protein_id\_$taxid\n";
$t = 1;
my $seq = $1;
$seq =~ tr/BZ/DE/;
$seq =~ s/[^ARNDCQEGHILKMFPSTWYV]//gi;
print OUT "$seq\n";
}
elsif($t) {
if(m,",) {
tr/BZ/DE/;
s/[^ARNDCQEGHILKMFPSTWYV]//gi;
print OUT $_,"\n";
$t = 0;
}
else {
tr/BZ/DE/;
s/[^ARNDCQEGHILKMFPSTWYV]//gi;
print OUT $_,"\n";
}
}
}
close($in_fh);
close(OUT);
2.Genbank转fna格式
#!/usr/bin/env python
import sys
from Bio import SeqIO
if len(sys.argv) < 3:
print('USAGE: gbk2fna GBK FNA')
sys.exit(65)
SeqIO.write(SeqIO.parse(sys.argv[1], 'genbank'), sys.argv[2], 'fasta')