刘小泽写于2020-3-9
不涉及软件的具体使用方法,重点在于结果的比较
从一个小问题入手不断探索,也是一种不错的学习方式
需要先进行预处理步骤——得到BQSR bam
下面的流程就简单带过了,只使用了标准流程,详细流程可以跟着https://www.yuque.com/biotrainee/wes/
走一遍
gatk调用的mutect2
其中的-L
参数指定了:输出exon上的结果
# 以下的变量都需要自己提前定义好,比如:
# $REF=/home/reference/mm10.fa
# Target the analysis to specific genomic intervals with the -L parameter
gatk --java-options "-Xmx20G -Djava.io.tmpdir=${TMP_DIR}" Mutect2 \
-R $REF \
-I ${ALIGN_DIR}/${N}_bqsr.bam -normal $N \
-I ${ALIGN_DIR}/${T}_bqsr.bam -tumor $T \
-L $BED \
-O ${RESULT_DIR}/${name}_mutect2.vcf
gatk --java-options "-Xmx20G -Djava.io.tmpdir=${TMP_DIR}" FilterMutectCalls \
-R $REF \
-V ${RESULT_DIR}/${name}_mutect2.vcf \
-O ${RESULT_DIR}/${name}_somatic.vcf
cat ${RESULT_DIR}/${name}_somatic.vcf | perl -alne '{if(/^#/){print}else{next unless $F[6] eq "PASS";next if $F[0] =~/_/;print } }' > \
${RESULT_DIR}/${name}_filter.vcf
然后是muse
# 需要先搞一个样本的配置文件,例如
# cat >config
# normal1 tumor1
# normal2 tumor2
#...
cat config | while read i;do
conf=($i)
name=${conf[1]}
N=${conf[0]}
T=${conf[1]}
# for somatic calling
cd $RESULT_DIR
muse call -O $name -f $REF_DIR \
$ALIGN_DIR/${T}_bqsr.bam \
$ALIGN_DIR/${N}_bqsr.bam
muse sump -I ${name}.MuSE.txt -E \
–O ${name}_muse.vcf –D $DBSNP
done
最后是varscan
cat config | while read i;do
conf=($i)
name=${conf[1]}
N=${conf[0]}
T=${conf[1]}
# for somatic calling
# java -jar VarScan.jar somatic normal.pileup tumor.pileup output.basename
cd $RESULT_DIR
normal_pileup="samtools mpileup -q 1 -f $REF_DIR $ALIGN_DIR/${N}_bqsr.bam"
tumor_pileup="samtools mpileup -q 1 -f $REF_DIR $ALIGN_DIR/${T}_bqsr.bam"
varscan somatic <($normal_pileup) <($tumor_pileup) $name
varscan processSomatic ${name}.snp
# 重点关注:${name}.snp.Somatic.hc (.hc means High Confidence)
done
利用python脚本将varscan结果转为vcf
# script: https://github.com/student-t/Varscan2VCF/blob/master/vscan2vcf.py
python varscan2vcf.py lung-1.snp.Somatic.hc >lung-1_varscan.hc.snp.vcf
三个软件结果的统计
1 首先:统计所有原始的vcf文件variant数量
ls *vcf | while read i;do( count=`cat $i | grep -v "#" | wc -l`; name=`basename $i`;echo -e $count"\t"$name);done
# 4631 lung-1_gatk.filter.vcf
# 23875 lung-1_muse.vcf
# 29917 lung-1_varscan.hc.snp.vcf
2 然后找落在exon上的snv
首先需要exon.bed文件
BED=$wkd/reference/mm10.exon.bed
# 关于得到exon.bed
cat CCDS.current.txt | grep "Public" | perl -alne '{/\[(.*?)\]/;next unless $1;$gene=$F[2];$exons=$1;$exons=~s/\s//g;$exons=~s/-/\t/g;print "$F[0]\t$_\t$gene" foreach split/,/,$exons;}'|sort -u |bedtools sort -i |awk '{if($3>$2) print "chr"$0"\t0\t+"}' > mm10.exon.bed
然后用SnpSift的intervals功能
# for muse
ls *muse.vcf | while read i;do (name=${i%%.*}.exon.vcf; cat $i | java -jar ~/biosoft/snpEff/SnpSift.jar intervals $BED >$name);done
# for varscan
ls *varscan.hc.snp.vcf | while read i;do (name=${i%%.*}.hc.snp.exon.vcf; cat $i | java -jar ~/biosoft/snpEff/SnpSift.jar intervals $BED >$name);done
# for gatk
ls *gatk.filter.vcf | while read i;do (name=${i%%.*}.filter.exon.vcf; cat $i | java -jar ~/biosoft/snpEff/SnpSift.jar intervals $BED >$name);done
3 然后添加dbSNP信息
使用SnpSift的annotate功能【需要准备好dbSNP的vcf】
ls *exon.vcf | while read i;do (name=${i%%.*}.exon.dbsnp.vcf;java -jar ~/biosoft/snpEff/SnpSift.jar annotate /home/yunzeliu/project/2020-01-20-mm10_wes/reference/snp_vcf/00-All.vcf $i >$name);done
对结果再进行一个统计
ls *.exon.dbsnp.vcf | while read i;do( count=`cat $i | grep -v "#" | wc -l`; name=`basename $i`;echo -e $count"\t"$name);done
# 4631 lung-1_gatk.exon.dbsnp.vcf
# 5961 lung-1_muse.exon.dbsnp.vcf
# 6840 lung-1_varscan.exon.dbsnp.vcf
4 看一下未知突变的情况
外显子测序一般会关注:癌有癌旁没有、数据库中未记录的、非同义突变
ls *.exon.dbsnp.vcf | while read i;do( count=`cat $i|perl -alne '{print if $F[2] eq "."}' | wc -l`; name=`basename $i`;echo -e $count"\t"$name);done
# 389 lung-1_gatk.exon.dbsnp.vcf
# 508 lung-1_muse.exon.dbsnp.vcf
# 556 lung-1_varscan.exon.dbsnp.vcf
三个软件结果的比较
先用bedops --intersect
统计
它是根据bed格式去统计
bedops --intersect <(vcf2bed < lung-1_gatk.exon.dbsnp.vcf) <(vcf2bed < lung-1_varscan.exon.dbsnp.vcf) <(vcf2bed < lung-1_muse.exon.dbsnp.vcf) | wc -l
# 4203
然后用bcftools isec
统计
如果使用bcftools,需要用vcf.gz格式,并进行index
# for gatk 【$F[3]指的是Ref碱基】
bcftools view -v snps lung-1_gatk.exon.dbsnp.vcf| perl -alne 'if (/^#/) {print} elsif (length($F[3]) == 1) {print}' | bgzip > lung-1_gatk.exon.dbsnp.vcf.gz
tabix -p vcf lung-1_gatk.exon.dbsnp.vcf.gz
# for varscan
bcftools view -v snps lung-1_varscan.exon.dbsnp.vcf| perl -alne 'if (/^#/) {print} elsif (length($F[3]) == 1) {print}' | bgzip > lung-1_varscan.exon.dbsnp.vcf.gz
tabix -p vcf lung-1_varscan.exon.dbsnp.vcf.gz
# for musr
bcftools view -v snps lung-1_muse.exon.dbsnp.vcf| perl -alne 'if (/^#/) {print} elsif (length($F[3]) == 1) {print}' | bgzip > lung-1_muse.exon.dbsnp.vcf.gz
tabix -p vcf lung-1_muse.exon.dbsnp.vcf.gz
然后
bcftools isec -n=3 lung-1_gatk.exon.dbsnp.vcf.gz lung-1_varscan.exon.dbsnp.vcf.gz lung-1_muse.exon.dbsnp.vcf.gz -p isec
cat ./isec/0001.vcf | grep -v "#" | wc -l
# 4176
结果输出在isec
目录中
对比两个结果
bedops
结果是:4203
bcftools
结果是:4176
看一下二者差异
# 这行代码的意思是:找出来file2(bedops运行结果)相对于file1(bcftools isec运行结果)不同的值
awk 'NR==FNR{a[$0]}NR>FNR{ if(!($1 in a)) print $0}' <(cat ./isec/0001.vcf | grep -v "#" | cut -f2) <(bedops --intersect <(vcf2bed < lung-1_gatk.exon.dbsnp.vcf) <(vcf2bed < lung-1_varscan.exon.dbsnp.vcf) <(vcf2bed < lung-1_muse.exon.dbsnp.vcf) | cut -f3)
# 总共28个...
# 3978708
# 76066627
# 41103551
# 112930880
# 114862377
# 114865432
# 37335619
# 37766922
# 11601930
# 11703882
以位点3978708为例:
# 在bedops结果中检查
bedops --intersect <(vcf2bed < lung-1_gatk.exon.dbsnp.vcf) <(vcf2bed < lung-1_varscan.exon.dbsnp.vcf) <(vcf2bed < lung-1_muse.exon.dbsnp.vcf) | grep 3978708
# chr10 3978707 3978708
# 在bcftools结果中检查
cat ./isec/0001.vcf | grep -v "#" | grep 3978708
# 无
继续往上推,isec是由三个软件的vcf得到的,那么看看三个文件中是否存在这个位点
# for gatk
grep 3978708 lung-1_gatk.exon.dbsnp.vcf
# chr10 3978708 . TG CA . PASS CONTQ=93;DP=306;ECNT=1;GERMQ=93;MBQ=20,20;MFRL=190,181;MMQ=60,60;MPOS=51;NALOD=2.08;NLOD=35.22;POPAF=6.00;SEQQ=93;STRANDQ=93;TLOD=239.95 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:57,60:0.507:117:27,28:26,31:28,29,25,35 0/0:177,0:8.228e-03:177:71,0:93,0:77,100,0,0
# for varscan
grep 3978708 lung-1_varscan.exon.dbsnp.vcf
# chr10 3978708 rs29320259 T C . PASS AF=0.5139;DP=188;SOMATIC;SS=2;SSC=187;GPV=1.0;SPV=1.8965275762250198E-19;GENEINFO=270685:Mthfd1l;RSPOS=3978708;SAO=0;VC=snp;VP=050028000A05030100000100;dbSNPBuildID=125
# for muse
grep 3978708 lung-1_muse.exon.dbsnp.vcf
# chr10 3978708 rs29320259 T C . PASS SOMATIC;GENEINFO=270685:Mthfd1l;RSPOS=3978708;SAO=0;VC=snp;VP=050028000A05030100000100;dbSNPBuildID=125 GT:DP:AD:BQ:SS 0/1:120:57,62:29,32:2 0/0:183:182,0:29,0:.
看到三个软件都能得到这个位点的结果,但是gatk得到的是:TG => CA
,其他两个软件得到的是:T => C
在IGV中验证这个位点
# tumor
samtools view -h $wkd/analysis/3-align/lung-1_bqsr.bam \
chr10:3978600-3978800 | samtools view -Sb - > lung-1.test.bam
samtools index lung-1.test.bam
# normal
samtools view -h $wkd/analysis/3-align/normal-lung-2_bqsr.bam \
chr10:3978600-3978800 | samtools view -Sb - > normal-lung-2.test.bam
samtools index normal-lung-2.test.bam
发现两个SNV在一起(红T、褐G、绿A、蓝C),所以GATK将它认为是:TG 变 CA
很明显,gatk把3978708和3978709这两个SNV放在了一起,如果单独看3978709也是没有结果的;但另外两个软件把这两个分开展示了:
# 两个位点分开写了
grep 3978708 lung-1_varscan.exon.dbsnp.vcf
# chr10 3978708 rs29320259 T C . PASS AF=0.5139;DP=188;SOMATIC;SS=2;SSC=187;GPV=1.0;SPV=1.8965275762250198E-19;GENEINFO=270685:Mthfd1l;RSPOS=3978708;SAO=0;VC=snp;VP=050028000A05030100000100;dbSNPBuildID=125
grep 3978709 lung-1_varscan.exon.dbsnp.vcf
# chr10 3978709 rs29362746 G A . PASS AF=0.5068;DP=189;SOMATIC;SS=2;SSC=185;GPV=1.0;SPV=3.092867428835387E-19;GENEINFO=270685:Mthfd1l;RSPOS=3978709;SAO=0;VC=snp;VP=050000000305030100000100;dbSNPBuildID=125
所以,这样造成的一个影响就是:
用之前的办法寻找GATK、varscan、muse三个的交集,这个位点3978708
和下个位点3978709
都是不存在的
原因是什么?
之前用bcftools的时候,指定了一个参数:-v
,其中的几个选项是:
snps, indels, mnps, others
我们这里只想看snps,于是都指定了-v snps
但是gatk把相邻的两个snv当成了mnps
MNP (multi-nucleotide polymorphism, e.g. a dinucleotide substitution)
两个相邻的SNP会成为MNP一搜索就有结果:
# for gatk snps
bcftools view -v snps lung-1_gatk.exon.dbsnp.vcf | grep 3978708
# 无结果
# for gatk mnps
bcftools view -v mnps lung-1_gatk.exon.dbsnp.vcf | grep 3978708
# chr10 3978708 . TG CA . PASS CONTQ=93;DP=306;ECNT=1;GERMQ=93;MBQ=20,20;MFRL=190,181;MMQ=60,60;MPOS=51;NALOD=2.08;NLOD=35.22;POPAF=6;SEQQ=93;STRANDQ=93;TLOD=239.95 GT:AD:AF:DP:F1R2:F2R1:SB 0/1:57,60:0.507:117:27,28:26,31:28,29,25,35 0/0:177,0:0.008228:177:71,0:93,0:77,100,0,0
而GATK结果中像这样的MNPs有多少个呢?
bcftools view -v mnps lung-1_gatk.exon.dbsnp.vcf | grep -v "#" |wc -l
# 31个
如何将MNP变成SNP显示?
https://www.biostars.org/p/280202/
conda install -y vt
bcftools norm -m -both lung-1_gatk.exon.dbsnp.vcf | vt decompose_blocksub - -o new-lung-1_gatk.exon.dbsnp.vcf
之后再次查看位点chr10:3978708
就拆分成了两个
grep 3978708 new-lung-1_gatk.exon.dbsnp.vcf
# chr10 3978708 . T C . PASS CONTQ=93;DP=306;ECNT=1;GERMQ=93;MBQ=20,20;MFRL=190,181;MMQ=60,60;MPOS=51;NALOD=2.08;NLOD=35.22;POPAF=6;SEQQ=93;STRANDQ=93;TLOD=239.95;OLD_CLUMPED=chr10:3978708:TG/CA GT:AD:AF:DP:F1R2:F2R1:SB 0/1:57,60:0.507:117:27,28:26,31:28,29,25,35 0/0:177,0:0.008228:177:71,0:93,0:77,100,0,0
# chr10 3978709 . G A . PASS CONTQ=93;DP=306;ECNT=1;GERMQ=93;MBQ=20,20;MFRL=190,181;MMQ=60,60;MPOS=51;NALOD=2.08;NLOD=35.22;POPAF=6;SEQQ=93;STRANDQ=93;TLOD=239.95;OLD_CLUMPED=chr10:3978708:TG/CA GT:AD:AF:DP:F1R2:F2R1:SB 0/1:57,60:0.507:117:27,28:26,31:28,29,25,35 0/0:177,0:0.008228:177:71,0:93,0:77,100,0,0
再次对比两个结果
awk 'NR==FNR{a[$0]}NR>FNR{ if(!($1 in a)) print $0}' <(bedops --intersect <(vcf2bed < new-lung-1_gatk.exon.dbsnp.vcf) <(vcf2bed < lung-1_varscan.exon.dbsnp.vcf) <(vcf2bed < lung-1_muse.exon.dbsnp.vcf) | cut -f3) <(cat ./isec/0001.vcf | grep -v "#" | cut -f2) | wc -l
# 这次bcftools的结果反超27个
cat ./isec/0001.vcf | grep -v "#" | wc -l
# 4229(之前是4176)
# bedops结果是:4203
按说这次结果应该一样才对,再次检查一个位点(chr2:142256608
):
awk 'NR==FNR{a[$0]}NR>FNR{ if(!($1 in a)) print $0}' <(bedops --intersect <(vcf2bed < new-lung-1_gatk.exon.dbsnp.vcf) <(vcf2bed < lung-1_varscan.exon.dbsnp.vcf) <(vcf2bed < lung-1_muse.exon.dbsnp.vcf) | cut -f3) <(cat ./isec/0001.vcf | grep -v "#" | cut -f2) |head
# 142256608
# 92243077
# 102477559
# 102614562
# 102973071
# 104893180
# 105035217
# 106891444
# 108755519
# 109143523
如果看看bedops的结果就能明白了:
bedops --intersect <(vcf2bed < new-lung-1_gatk.exon.dbsnp.vcf) <(vcf2bed < lung-1_varscan.exon.dbsnp.vcf) <(vcf2bed < lung-1_muse.exon.dbsnp.vcf) | grep 142256608
# 无结果
bedops --intersect <(vcf2bed < new-lung-1_gatk.exon.dbsnp.vcf) <(vcf2bed < lung-1_varscan.exon.dbsnp.vcf) <(vcf2bed < lung-1_muse.exon.dbsnp.vcf) | grep 1422566089
# chr2 142256607 142256609
因为bedops
的结果是bed格式,它把相邻的位点(142256608和142256609)合并在了一起,所以这个结果少的那27个,就是来自之前GATK检测的31个MNPs
既然之前GATK检测了31个MNPs,那么为何这里结果之差27呢?
这是因为我们求的是交集,虽然GATK能检测到,但另外两个可能检测不到啊
例如:
grep 129697407 new-lung-1_gatk.exon.dbsnp.vcf
# chr6 129697407 . A C . PASS CONTQ=93;DP=930;ECNT=2;GERMQ=93;MBQ=20,33;MFRL=196,165;MMQ=60,33;MPOS=2;NALOD=2.39;NLOD=72.24;POPAF=6;SEQQ=93;STRANDQ=12;TLOD=29.33;OLD_CLUMPED=chr6:129697407:AG/CA GT:AD:AF:DP:F1R2:F2R1:SB 0/1:524,16:0.036:540:285,7:230,8:216,308,3,13 0/0:362,0:0.004103:362:197,0:154,0:143,219,0,0
# chr6 129697408 . G A . PASS CONTQ=93;DP=930;ECNT=2;GERMQ=93;MBQ=20,33;MFRL=196,165;MMQ=60,33;MPOS=2;NALOD=2.39;NLOD=72.24;POPAF=6;SEQQ=93;STRANDQ=12;TLOD=29.33;OLD_CLUMPED=chr6:129697407:AG/CA GT:AD:AF:DP:F1R2:F2R1:SB 0/1:524,16:0.036:540:285,7:230,8:216,308,3,13 0/0:362,0:0.004103:362:197,0:154,0:143,219,0,0
grep 129697407 lung-1_muse.exon.dbsnp.vcf
# 无结果
grep 129697407 lung-1_varscan.exon.dbsnp.vcf
# 无结果
因此即使将GATK的MNP变成SNP,但依然不会被bcftools的isec
结果收录
一般来说,最后需要做个韦恩图来对比一下软件【当然这个图是之前的版本】,之后就可以挑取交集继续向下分析
欢迎关注我们的公众号~_~
我们是两个农转生信的小硕,打造生信星球,想让它成为一个不拽术语、通俗易懂的生信知识平台。需要帮助或提出意见请后台留言或发送邮件到jieandze1314@gmail.com