http://www.cnblogs.com/leezx/p/6105646.html
clipped alignment因为着在比对过程中,并没有用到全部的read的序列,read两段的序列被截取了(clip or trim)。如下表示,即为clip alignment。
Alignment:
Read: ACGGTTGCGTTAA-TCCGCCACG
| ||||||||| ||||||
Reference: TAACTTGCGTTAAATCCGCCTGG
与clipped alignment对应的是spliced alignment,即read的中间没有比对到而两段比对上了。对应的表示如下:
Alignment:
Read: ACGGTTGCGTTAAGCTCATCCGCCACG
| ||||||||||||| |||||||||
Reference: ACGGTTGCGTTAA…..TCCGCCACG
clip alignment对应的CIGAR表示有两种S (soft clip) 和H (hard clip)。
BWA提到If the read has a chimeric alignment, the paired or the top hit uses soft clipping and is marked with neither 0x800 nor 0x100 bits. All the other hits part of the chimeric alignment will use hard clipping and be marked with 0x800 if option “-M” is not in use, or marked with 0x100 otherwise.
即如果发现嵌合比对,最好的比对top hit标记为soft clipping,其余的则标记为hard clipping。
如果是hard clip,则截取的部分不会在SAM文件对应的read中出现 (clipped sequences not present in SEQ),如果是soft clip (clipped sequences present in SEQ),则会出现。
理解1:
Hard masked bases do not appear in the SEQ string, soft masked bases do.
So, if your cigar is: 10H10M10H
then the SEQ will only be 10 bases long.
if your cigar is 10S10M10S then the SEQ and base-quals will be 30 bases long.
首先,结果展示方式有区别:比如说10H10M10H,第10列的碱基序列只显示10bp;而如果是10S10M10S的话,就会显示30bp的序列,尽管开头和结尾的20bp也没比上。
In the case of soft-masking, even though the SEQ is present, it is not used by variant callers and not displayed when you view your data in a viewer. In either case, masked bases should not be used in calculating coverage.
在soft中,即使显示的序列比hard的要长,但是计算变异或可视化比对结果时,这些序列也不会被考虑。而且,2种情况计算覆盖度时,mask的碱基都不会考虑。
理解2:
当同一条reads比对到不同chr时(嵌合reads),会以hard clip的显示显示。比如上面的例子,R1分别比到了viral基因组和chr7上(前面69bp比到viral,后面35比到chr7),R2比到了chr7。