Broad Institute视频笔记Introduction to Germline Variant Discovery

这篇笔记是这个系列视频的第5讲，笔记有的是用英文记的，有的用中文。因为是用零碎的时间来看视频记笔记，所以用英语还是中文看当时的心情。。。

视频地址：here

what difference between germline and somatic? Germline is essentially all the variants that you are born with that you herit from your parents, one half from your mother, the other from your father. There are also some germline variants that are unique to each person, and thoes are only in the range about 30. So this workshop we are going to focus on short variants, which are point mutations, and insertions, and deletions less than 50bp(This is an arbitrary threshhold that we set).

So this is overall best practices piplines for germline variants discovery. We went over the first column which is data pre-processing. We spoke about that. In this talk ,we will give you overview of the variants calling which is the second column. And the third column is filtering your variants and refining the genotypes that are called.

So the key player in germline variants discovery is this tool: HaplotypeCaller. HaplotypeCaller follows essentially four steps. The first step is identifying acitve regions. What HaplotypeCaller essentially doing is looking at your genome, and saying I'm gona to focus on the regions that have most variations. Because a lot of your genome is similar to your reference. So you want to concentrate on the regions that have variation to make it more effecient. And that is essentially what it doing at identifying active regions. And then it does local realignment using a graph base assembly to create haplotypes from reads around complex sequencing regions in the genome. And then it takes each of these reads and then five determines likelihoods for each reads against the haplotypes that it creates. The fourth step is getting the genotype for your calls.

So why do we do joint analysis? A single genome in itself is not giving you much information. For example, you find a variation in a sample, and you want to see whether it associates with any disease. How do you know that a variant call has some biological consequences? Maybe the variant is present in one of parent, and they are healthy and they don't show disease phenotypes, or maybe in that population most people have that variant. So adding that family and population data will help you filter out all of that variants. That is essentially the idea behand doing joint analysis.You want to focus on the variants that are rare in a population because those might have something to do with any disease-causing phenotype.

We did haplotype caller,and joint calling together. So if you had 10 samples, haplotypecaller would create a graph for all the 10 samples and then do joint calling for the entire dataset. So if you increase the number of your samples, the graph just go bigger and bigger. This make haplotypecaller very time consuming. If you say have one-way sequencing which gave your 10 samples, and then you got the second way sequencing and need to add sampls to your original data. You need to go through haplotyphcaller process all of again.

上面讲到如果你想在原有的Data上再加几个data，那么你需要重新运行haplotypecaller，非常的耽误时间。所以主讲人提出了一个新办法，这个方法是用haplotypecaller把每一个样品运行完后，生成一个GVCF文件，这时如果你想再加样品的话，只需要运行一遍haplotypecaller，然后把GVCF文件合并，再做joint analysis.

如果使用这个GVCF文件模式的话，同样也是4步。这4步与上面讲的一样。

GVCF文件是什么呢？它是genomics VCF files。它与vcf文件格式相同，除了一点：GVCF has information of all the positions in your genome.但是我们为什么需要这个信息呢？比如上面这个图，黄色部分是你样品里所有的variants信息，而蓝色部分是reference信息。如果你有多个样品在你的dataset里，你发现其中一个样品里有一个很有意思的variant,而其他样品里这个位置都有很好的reads覆盖，所以你可以知道这个variant很有意思。或者说，如果你在某一个位置没有看到variants,那么你可以判断出是真的没有，还是因为read没有覆盖到这个区域。所以如果你的文件里包含了refenrence信息，你就可以解释刚刚提到的问题了。

在做完joint analysis ，你需要filter your variants and refine your genotypes.

有几种方法可以filter variants。上面这张图是使用一个tool: VQSR完成的。VQSR takes seven different annotations and uses a combination of those to see what are good variants and what passes those variants. 在这张图里，所有红色的点是good variants，你需要保留的。绿色的点代表的variants不是特别的差，但是也不是很好。If you want to do this with two annotations, you take any two random annaotations on the X and Y axis. And if you try to form a box to enclose good variants, there's no real good way to do that with two annotations.

你可以通过添加population priors and family priors to make your calls better to define the genotype.在上面这个例子里，我们在一个样品里发现了417个de novo mutations. 这个数字是不合理的，因为一个人大概只有30左右mutations that unique to just you.这个数字显然太多了。在apply population priors and family priors，这个数字降低到17.当我们使用high confidence de novo mutations，我们最终得到8个de novo mutations。这个数字现在就make sense了。

Once you have a call set, at the end you want to check evaluate your call set and see how good were my calls? How sensitive my calls were? How specific were they? What is the ratio of true positives to false positives of false negatives. To do this, you would need true dataset, we would do concordance analysis.