该工具主要用于去除基因组重复的序列。
详情请查看githup
Note
- 软件需要内存 >100 G
- 该软件要求基因组大小 < 4G
简单使用
基本思路:将>40x的Illumina reads构建kmer 频率表,然后基于该频率表去除基因组重复的序列。
1.直接从githup下载即可,需要jellyfish,使用conda安装就可以
conda install -c bioconda jellyfish
2. 数据准备
- assemble.fasta
- reads1.gz
- reads2.gz
3. 构建kmer频率表
ls *.gz > fq.lst
perl Bin/Graph.pl pipe -i fq.lst -m 2 -k 15 -s 1,3 -d Kmer_17
# 参数:
-m 最小kmer出现次数
-i fastq文件
-k kmer 大小
-d 输入文件
-s 如下所示
1: count k-mer by jellyfish
2: record unique k-mer into .h5 file
3: record unique k-mer into .bit file
4: record all k-mer into .h5 file
5: record all k-mer into .bit file
6: record all kmer into .bit with -m is 0.5 the peak
7: get the genome size, repeate rate and hete rate
注意:
# For k=17, we recommend:
perl Graph.pl pipe -i fq.lst -m 2 -k 17 -s 1,3,5 -d Kmer_17
# For k>17, we recommend:
perl Graph.pl pipe -i fq.lst -m 2 -k 23 -s 1,2,4 -d Kmer_23
#######################################
k=15 is suitable for genome with size <100M.
k=17 is suitable for genome with size <10G.
This version is only support k<=17.
上述结果位于Kmer_17/02.Uinque_bit/kmer_17.bit
4. 去除基因组重复序列
perl remDup.pl <genome.fa> <outdir> <cutoff:0.7>
Options:
--ref <str> The ref genome to build kbit
--kbit <str> The unique kmer file
--kmer <int> the kmer size [15]
--sort <int> sort seq by length [1]
如下命令
perl Bin/remDup.pl --kbit Kmer_17/02.Uinque_bit/kmer_17.bit \
--kmer 17 assemble.fasta Compress 0.3
结果位于:compress file: Compress/trinity.single.fasta.gz
注意:
a. If the compress file is larger than estimated genome size, turn down the cutoff value
b. If the compress file is small than estimated genome size, turn up the cutoff value
其余软件
- Purge_dups
- purge_haplotigs