临床生物信息学中的GWAS分析

今天要分享的是一本合集 Clinical Bioinformatics 临床生物信息学实验指南中的第五章Bioinformatics Challenges in Genome-Wide Association Studies (GWAS) 

De R., Bush W.S., Moore J.H. (2014) Bioinformatics Challenges in Genome-Wide Association Studies (GWAS). In: Trent R. (eds) Clinical Bioinformatics. Methods in Molecular Biology (Methods and Protocols), vol 1168. Humana Press, New York, NY

http://www.springer.com/series/7651

一张导图总结 

下载链接—GWAS原理作者rapunzel

作者之一Jason H. Moore教授就职于Geisel School of Medicine at Dartmouth,研究方向是生物统计、流行病学和基因组,开发SPARCoC软件,还写过一本书Computational Methods for Genetics of Complex Traits(2010)以后有钱了找来看看。。。。

真的很贵

好了,继续来说这篇文章

摘要:本章回顾了GWAS 的基本概念、用于捕获遗传变异的技术、遗传力缺失问题、高效实验设计、减少引入到数据集中的偏差以及如何利用新的资源(如电子病历)

Key words:Data imputation, Epistasis, Electronic medical records, Filtering, Gene–gene interactions, GWAS, Meta-analysis, Missing heritability, Replication

一、简介

GWAS 是基于常见疾病-共同变异(Common Disease—Common Variant,CD-CV)假说的,即common diseases (II型糖尿病,类风湿性关节炎或原发性高血压等) are caused in part by genetic variations that are also common in the population。

SNP遗传效力和疾病遗传力的关系 If common variants have a small effect size but common diseases show a strong inheritance in families (high heritability), then almost by definition the disease must be influenced by multiple genetic factors.

The missing heritability problem: GWAS has had limited success in detecting genetic variants that account for a large portion of the heritability of any common disease trait. 作者举例在breast cancer研究中找到的两个loci仅能解释5.9%的乳腺癌家族风险。

    *产生原因之一是上位效应epistatic interactionsBiological epistasis refers to the physical interactions between biomolecules that are influenced by multiple genetic variants. Statistical epistasis is the term for the nonadditive interactions between multiple genes, each of which affects disease susceptibility, and the environment.

    *解决办法: 1) Designing our studies to search for nonlinear interactions amongst SNPs. 2) Using methods such as meta-analysis and data imputation to increase our statistical power. 3) Establishing strict criteria for defining phenotypes

二、材料

介绍了IlluminaAffymetrix两家测序平台以及Electronic Medical Records的应用,这里略过

三、方法

Overview of the GWAS process                                            

1 关于基本概念:

SNP-single base pair changes in the DNA sequence, have now become the modern unit of genetic variation

MAF-the frequency of the less common allele is referred to as minor allele frequency

LD-Linkage disequilibrium is a measure of correlation between SNP alleles at one site and the specific alleles carried at variant sites nearby. 用D′ 或r2来计算

Haplotype-a particular combination of alleles along a chromosome

tag SNPs-in strong LD with other variants surrounding them最终会被筛选出来

2 关于实验设计:

(1)Case–Control VS Quantitative 

Case–Control案例研究通常是二元结果,如病例/对照或受影响/未受影响。若病例中SNP频率高于对照组,说明SNP与疾病风险增加有关;Quantitative定量研究评估量化或连续性状,以获得定量值(如HDL、LDL),研究SNP或等位基因的频率是否与数量性状相关。

(2)Standardizing Phenotype Criteria

对表型的标准化定义是非常重要的,特别是在多机构的合作中。有时案例研究里把病人由case错归为control的影响要比定量研究中记录错数值严重得多。

(3)Testing for an Association(重点)  

    1)前期准备 

       选择合适的方法——关联分析可分为allelicgenotypic与表型相关联,需根据具体情况选择显性、隐形、加性效应模型来分析

       调整数据集——用Regression方法调整协变量以防出现假阳性结果

      群体结构分析Population substructure——作为重要协变量之一, ethnic-specific SNPs may show up to be associated with a trait due to population stratification,可以用STRUCTUREEIGENSTRAT来分析

    2)单一位点 VS 多位点

Binary traits, case–control研究中常采用 a contingency table methodlogistic regression.

       *A contingency table summarizes the number of individuals within each genotypic group for a single biallelic SNP. It searches for a deviation from the null hypothesis that there is no association between the phenotype and genotype. e.g. the chi-square test or the Fisher’s exact test by SAS, SPSS, Stata, or Microsoft Excel.

        *Logistic regression is an extension of linear regression where the phenotypic outcome studied is transformed using a logistic function. This method predicts the probability of an individual having a case status, given their genotype class.  因允许协变量调整而被更广泛地使用

对于quantitative traits,常采用方差分析Analysis of Variance (ANOVA). It assumes that 1) the trait is normally distributed (正态分布), 2) the variance of the trait is the same within each group, and 3) that the groups are independent. For single-SNP analysis, ANOVA functions under the null hypothesis.  

PLINK是GWAS分析中的常用软件,功能强大,操作简便,可以使用the allelic orinheritance模型, or by using the Cochran-Armitage test (a contingency table method).


由于用linear modeling framework 去分析单一SNPs at a time会导致之前提到过的missing heritability问题, 因此需要用到multi-locus analysis, more holistic approaches that recognize the complex landscape of the genotype–phenotype relationship and examine nonlinear interactions between genetic variants throughout the genome. 这里最大的挑战在于处理50万个SNP会消耗大量计算资源,需用特定的过滤方法来减轻计算压力。

一般的GWAS single SNP分析会基于MAF\LD值进行初始过滤(仍会留下30万SNPs), 然后会通过设定显著性阈值筛选出一些主效markers (和疾病强关联的单一SNPs)

另一种过滤方法是检测marks有没有在某一通路、蛋白家族中存在相互作用 dataset can also be filtered so that only those multi-marker interactions will be examined that fit within a certain biological context such as a biological pathway, protein family, and group of genes or proteins involved in a certain molecular function.

Biofilter algorithm 算法 combines biomedical knowledge from multiple public repositories with statistical methods such as logistic regression or multifactor dimensionality reduction (MDR) method to analyze SNP–SNP combinations. 

    3)Post Analysis 纠错

p-value 检验 is defined as the probability of observing a test statistic that is equal to or greater than the observed test statistic, if the null hypothesis is true. P值的问题

GWAS中常用的多重假设检验矫正方法有:

    *The Bonferroni correction

    *Adjusting the False Discovery Rate (FDR)

     *Using permutation testing to adjust the significance threshold by PLINK, PRESTO, and PERMORY 

(4)结果的可重复

重复的唯一目的是评估GWAS最初的阳性结果,证实其有效性和可信度

    1)Statistical Replication

要实现统计上的可重复需满足以下条件:

    *样本量足够大  由于winner’s curse 赢家的诅咒 (GWAS在研究群体中的效应被高估,即比实际在人群中要高) 的存在,这点至关重要

    *重复必须在同一群体的独立数据集中进行,并应该使用相同的标准来定义所讨论的

    *由于GWAS标记是基于LD模式选择的,应旨在重复某个基因组区域,而不一定是最初研究中得到的具体某个SNP

    2)Meta-analysis

Meta-analysis is a statistical method for combining several different studies to provide one summary result  aims to examine the effect of the same allele across all studies.(前提是所有研究需基于相同的假说). 可以用Cochran’s Q 或 I2 statistic来计算heterogeneity

    3)Data Imputation

The imputation procedure makes use of the known LD and haplotype patterns in reference panels to estimate genotypes for SNPs that were not directly genotyped within a study. 常用的算法有BimBam, IMPUTE, MaCH, and Beagle (均基于haplotype phasing algorithms, which estimate the contiguous set of alleles that lie on a specific chromosome)

四、 展望

Although, as the content of genotyping chips, cohort sizes, and biobanks grow even larger, the challenges of data manipulation, quality control, strong study design, and strict phenotypic definitions grow more complex. Hence, moving forward human geneticists will have to develop bioinformatics infrastructure and expertise to overcome such challenges. Most importantly, scientists will have to combine their bioinformatics efforts with genetics, biochemistry and cell biology to confirm the functional consequence and biological relevance of the genotype–phenotype associations that are identified. 


本文提纲挈领地阐明了医学临床上的GWAS分析基本概念和原理,关联算法模型的选择和使用,特别是指出了现有GWAS存在的不足以及我们在具体实践中应该如何避免误差。建议小伙伴在学习GWAS时先看这篇入门介绍,再根据个人水平去查陌生的专业名词的含义以及常用软件的使用方法。另一篇简书文章欢迎阅读GWAS基本分析内容

GWAS提出到现在已经十多年,发挥了重要的作用,存在很多问题 (参见扩展阅读),还有许多改进的空间。正如作者最后在Future Directions所说 ‘Ultimately, the translation of GWAS findings into clinical practice will rely upon correct assumptions regarding the genetic architecture of complex traits especially in the context of gene–gene and gene–environment interactions.’

参考文献:

见原文

推荐:GWAS – Science topic

扩展阅读:

GWAS的困境和遗传模型的新思

旋涡下的GWAS丨全基因组疾病研究价值几何?

GWAS还能走多远?——十年的思考

RVAS(低频突变关联分析)成为研究新宠,超越GWAS

GWAS研究中样本数量和结果真实有效性之间的关系是怎样的?

GWAS的基因型填充是怎么回事?

使用Plink对CNV做GWAS分析(一)

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 203,362评论 5 477
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 85,330评论 2 381
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 150,247评论 0 337
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,560评论 1 273
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,580评论 5 365
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,569评论 1 281
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 37,929评论 3 395
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,587评论 0 258
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 40,840评论 1 297
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,596评论 2 321
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,678评论 1 329
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,366评论 4 318
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 38,945评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,929评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,165评论 1 259
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 43,271评论 2 349
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,403评论 2 342

推荐阅读更多精彩内容