Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles
Gene Set Enrichment Analysis (GSEA) is a computational method that determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states (e.g. phenotypes).
实验可以得到一个根据一定度量标准排序的基因集L,例如两个条件下的差异基因表达。
大多数时候,研究都会集中于基因集的头尾两端;这种方法存在一些缺陷:
1)多重假设校正后,没有单个基因满足统计筛选的阈值,这是由于实验存在的噪音掩盖了基因原本的变化;
2)有时候,得到的基因看起来并不存在相同的功能或者联系;因此很难从生物学上去解释;
3)单基因分析容易错过重要的信号通路。
4)研究同一个生物系统,不同的组得到的显著差异基因交集可能很少。
GSEA可以判断实验得到的基因集S是否富集到基因集L的首尾。
给定一个预定义的基因集S Molecular Signature Database, MSigDB, GSEA分析基因集S中基因在基因集L中的首尾是否发生了富集。
GSEA算法的重要步骤:
第一步:计算富集分数(Enrichment Score,ES)
遍历基因集L ,当基因出现在S中加分,反之减分;加减分值由基因与表型的相关性决定。当分值累积到最大时就是富集分数。
第二步:估计ES显著水平
基于样品的置换检验可以计算P值。
第三步:多重假设矫正
根据基因集的大小对每个基因的ES做标准化,得到标准化NES(normalized enrichment score ,NES);
在每步使用等权重计算ES,ES会出现在基因集的中间;这样找到的富集基因集与表型并不能呈现出生物相关性。作者使用的新方法就是每步的权重由基因与表型的相关性计算。
The Leading-Edge Subset:
基因集L中处于从零达到ES值的基因集,它们的加分使得ES值达到最大。The Leading-Edge Subset中的基因可能在同一生物过程中发挥作用,更具有生物意义。
MSigDB Gene Sets:
Molecular Signatures Database v6.1已经更新到第6个版本了,包含8个主要的分类,每个分类下有着更详细的基因集;
GSEA Methods:
Gene sets:
- Expression data set D with N genes and k samples
- Ranking procedure to produce Gene List L. Includes a correlation (or other ranking metric) and a phenotype or profile of interest C.
- An exponent p to control the weight of the step.
- Independently derived Gene Set S of NH genes (e.g., a pathway, a cytogenetic band, or a GO category).
Enrichment Score ES(S):
- Rank order the N genes in D to form L={g1, . . . , gN} according to the correlation, r(gj)= rj, of their expression profiles with C.
- Evaluate the fraction of genes in S (‘‘hits’’) weighted by their correlation and the fraction of genes not in S (‘‘misses’’) present up to a given position i in L.
ES值:Phit -Pmiss最大值
预先定义的基因集S;待分析基因列表L;指数P的选择用来控制ES分布;r(gj)=rj 是定义的基因与表型的相关性系数。
L中第i个基因前有基因j也属于基因集S,Phit(S,i)=Phit(S,i)+|rj|p /NR ;与之相反,L中第i个基因前有基因j不属于属于基因集S时,Pmiss(S,i)增加。
Estimating Significance:
- Randomly assign the original phenotype labels to samples, reorder genes, and re-compute ES(S).
- Repeat step 1 for 1,000 permutations, and create a histogram of the corresponding enrichment scores ESNULL.
- Estimate nominal P value for S from ESNULL by using the positive or negative portion of the distribution corresponding to the sign of the observed ES(S).
Multiple Hypothesis Testing:
- Determine ES(S) for each gene set in the collection or database.
- For each S and 1000 fixed permutations � of the phenotype labels, reorder the genes in L and determine ES(S,π).
- Adjust for variation in gene set size. Normalize the ES(S, π) and the observed ES(S), separately rescaling the positive and negative scores by dividing by the mean of the ES(S, π) to yield the normalized scores NES(S, π) and NES(S)
- Compute FDR. Control the ratio of false positives to the total number of gene sets attaining a fixed level of significance separately for positive (negative) NES(S) and NES(S, π).