把read比对到基因组之后,需要提取唯一比对来进行下一步分析。
bowtie2和HISAT2 都没有只输出唯一比对的命令,所以需要对比对结果sam文件进行提取,可以通过sam文件最后一列的标签进行筛选。
Bowtie2
bowtie2的输出sam文件的标签:
AS:i:<N>
:Alignment score. Can be negative. Can be greater than 0 in [--local
] mode (but not in [--end-to-end
] mode). Only present if SAM record is for an aligned read.(引用自bowtie2 manual:http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#sam-output)
表示比对分数,只有比对上的reads才会有这个标签。
所以我们先用grep “AS:” aligned.sam
将所有比对上的reads提取出来。
XS:i:<N>
: Alignment score for the best-scoring alignment found other than the alignment reported. Can be negative. Can be greater than 0 in [--local
] mode (but not in [--end-to-end
]mode). Only present if the SAM record is for an aligned read and more than one alignment was found for the read. Note that, when the read is part of a concordantly-aligned pair, this score could be greater than [AS:i
] (引用自bowtie2 manual:http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#sam-output)
XS:i:<N>
标签是在多个比对时才会出现(multi mapping reads)。
所以用grep -v "XS:i"
命令即可实现将不包含XS:i
标签的例子全部提取出来。
bowtie2提取唯一比对的命令总结
grep "AS:" aligned.sam | grep –v "XS:" > unique_alignments.sam
HISAT2
HISAT2输出sam文件的标签:
ZS:i:<N>
: Alignment score for the best-scoring alignment found other than the alignment reported. Can be negative. Only present if the SAM record is for an aligned read and more than one alignment was found for the read. Note that, when the read is part of a concordantly-aligned pair, this score could be greater thanAS:i
.
NH:i:<N>
: The number of mapped locations for the read or the pair.
(引用自HISAT2 manual: http://daehwankimlab.github.io/hisat2/manual/)
ZS:i:<N>
仅当SAM记录用于已对齐的读取且已为该读取找到多个对齐时才出现。
NH:i:<N>
显示read 比对到基因组位置的个数<N>。
HISAT2提取唯一比对的命令总结
grep "NH:i:1" aligned.sam | grep -v "ZS:i" > unique_alignments.sam
进行提取后,一定要用wc -l unique_alignments.sam
来查看read的个数,用来确定和比对软件输出的log文件中记录的个数是否一致。
wc -l unique_alignments.sam
42731387 unique_alignments.sam
比对软件输出的log文件结果:
62624861 reads; of these:
62624861 (100.00%) were unpaired; of these:
7315939 (11.68%) aligned 0 times
42731387 (68.23%) aligned exactly 1 time
12577535 (20.08%) aligned >1 times
88.32% overall alignment rate
aligned exactly 1 time
表示唯一比对。
在网上查了很多方法,进行HISAT2比对时,很多人建议只需要用grep "NH:i:1"
就可以筛选出唯一比对,但是筛选出的read个数和输出的log文件显示的个数不相符。也看到其他人提问过这个问题,但是没有人解答,所以分享出来。
Bowtie1
只需要在mapping时 加入 -m 1
,就可以直接输出所有 唯一比对的read。