更新: 最近有看到新的软件可以检测高通量序列的污染情况---fastq_screen(https://github.com/StevenWingett/FastQ-Screen),相比较这个软件快很多且容易操作!
1.This tools was design to detection of contamination in nucleotide and protein sequence sets
2.install
three approaches
# SSE4.1
wget https://mmseqs.com/conterminator/conterminator-linux-sse41.tar.gz; tar xvfz conterminator-linux-sse41.tar.gz; export PATH=$(pwd)/conterminator/:$PATH
# AVX2
wget https://mmseqs.com/conterminator/conterminator-linux-avx2.tar.gz; tar xvfz conterminator-linux-avx2.tar.gz; export PATH=$(pwd)/conterminator/:$PATH
# conda
conda install -c bioconda conterminator
3.Getting started
Conterminator requires two input files:
(1) a FASTA file containing all sequences (example/dna.fna/example/prots.faa) and
(2) a mappingFile (example/dna.mapping /examples/prots.mapping), which maps FASTA identfiers to NCBI taxon identfiers. The program produces two output files with prefix (${RESULT_PREFIX}).
example:
To process nucleotide sequences
conterminator dna example/dna.fna example/dna.mapping ${RESULT_PREFIX} tmp
Protein sequences
conterminator protein example/prots.faa example/prots.mapping ${RESULT_PREFIX} tmp
4.Mapping file
Conterminator needs a mapping file, which assigns each fasta identifier to a taxonomical identifier.
We choose NT/NR database
blastdbcmd -db nt -entry all > nt.fna
blastdbcmd -db nt -entry all -outfmt "%a %T" > nt.fna.taxidmapping
conterminator dna nt.fna nt.fna.taxidmapping nt.result tmp