Expanding the computational toolbox for mining cancer genomes

Corresponding author: Li Ding
Director of Computational Biology, Oncology
Washington University School of Medicine, St. Louis, MO

Sample procurement, sequencing and analysis roadmap.

1.Sequencing strategies

WES转向WGS:WGS data are therefore considered to be the unbiased 'gold standard'

  • 1.1 Traditional sequencing analyses
    In practice, detection of all germline and somatic aberrations is a formidable challenge owing to limitations in current analysis algorithms, as well as to the quantity and quality of sequence data.
    实际上,由于当前分析算的局限性,以及测序数据的数量和质量的限制,检测所有种系和体细胞突变是一项艰巨的挑战。

  • 1.2 Subclonal analyses
    cancer progression has long been known to be a fundamentally clonal process, and sequence coverage is now becoming sufficiently large to permit detection of the low-prevalence events that are routinely associated with tumour subclones. Multisite and/or multistage sequencing and tumour sectioning experiments have begun to identify founding clones and subclones that contribute to cancer progression

  • 1.3 Single-cell sequencing
    Pioneering work on assessing CNAs in multiple tumour subpopulations was followed by single-cell sequencing using whole-genome amplification (WGA) of DNA extracted from nuclei that were sorted by flow cytometry.
    目前仍然存在一些挑战,如简并寡核苷酸引物WGA的放大偏差和多重置换扩增技术(degenerate oligonucleotide-primed WGA是指引物的3' 含6bp的随机序列,可以随机的和基因组DNA结合,从而实现对全基因组的扩增;multiple displacement amplification techniques利用随机引物和等温扩增可以获得高保真的DNA大片段,但该方法的主要缺陷在于非平衡的基因组覆盖率、扩增偏倚、嵌合序列及非特异扩增等),这些技术的偏倚导致了不均匀的覆盖,并因此难以确定体细胞的变化,包括SNVs、CNAs和结构畸变。由于两个等位基因中的一个的优先扩增,检测灵敏度受等位基因缺失的影响最大,有报道称等位基因缺失率为8 - 40%。大的CNAs仍然可以在基因组覆盖率较低的情况下进行检测(例如,5-6%),而不平等的覆盖率使得分析较小的CNAs和结构变异极其困难

2.Dissecting genomic changes in cancer

以下表格是注释和解读肿瘤基因组突变的计算工具

Program Function Synopsis Refs
SNV and indel detection
Bassovac SNV and indel detection Bayesian approach with tumour or normal impurity and clonality
GATK SNV and indel detection Analysis framework using MapReduce 23
JointSNVMix SNV detection Binomial/multinomial probability with pre-filtering 31
MuTect SNV and indel detection Bayesian probability with pre- and post-filtering 28
Pindel Indel detection Pattern growth learning method 38
SNVMix SNV detection Binomial mixture model 30
SomaticSniper SNV and indel detection Bayesian probability with posterior filtering 27
Strelka SNV and indel detection Bayesian probability with posterior filtering 29
VarScan SNV and indel detection Fisher exact test, filtering and FDR correction 24,25
Copy-number aberration, structural variant and gene fusion detection
BreakDancer Structural variant and indel detection Kolmogorov–Smirnov test on discordant reads 54
BreakFusion Gene fusion detection Alignment-based pipeline for transcriptomic data 68
BreakTrans Gene fusion mapping Integration of fusion discovery and breakpoint tools 73
ChimeraScan Chimeric transcription detection Discordant read pairs with posterior filtering 67
CREST Structural variant detection Heuristics and binomial test on soft-clipped reads 55
deFuse Gene fusion detection Dynamic programming split and discordant reads 65
DELLY Structural variant detection Integrated method of discordant and split reads 40
GASV-Pro Structural variant detection Plane sweep for segment intersection 57
Genome STRiP Structural variant detection Depth and split or discordant reads on populations 59
Hydra Structural variant detection Discordant reads with assembly validation 139
LUMPY Structural variant detection Integrated method of discordant and split reads 167
TIGRA Structural variant detection Debruijn graph-based assembly 42
Level I annotation and interpretation
ABSOLUTE Purity, ploidy and clonality prediction Optimization of logarithmic scores 148
ANNOVAR Functional prediction Annotation-based prediction 74
ASCAT Purity, ploidy and clonality prediction Goodness-of-fit ranking of candidate solutions 168
TUSON Explorer Gene classification Oncogene or tumour suppressor discovery using mutational signatures 100
CHASM Functional prediction Random forest classifier 84,85
MutationAssessor Functional prediction Conservation-based prediction (entropy score) 83
PolyPhen2 Functional prediction Probability model based on structure and alignment 81,169
SciClone Tumour clonality prediction Bayesian mixture model
SIFT Functional prediction Conservation-based prediction 82
SNPeff Functional prediction Annotation and coding effect prediction 75
THetA Purity, ploidy and clonality prediction Maximum likelihood of mixture composition 151
VEP Functional prediction Annotation-based prediction 170
Level II annotation and interpretation
Dendrix Mutation analysis De novo discovery of mutually exclusive mutations 128
HotNet Network analysis Diffusion model for significant networks 119
MEMo Network analysis Network modules with mutual exclusivity 122
MuSiC Mutation analysis Framework for significance analysis of mutations 92
Multi-Dendrix Mutation analysis De novo discovery of multiple sets of exclusive mutations 129
MutSigCV Mutation analysis Gene significance with variable background mutation rate 93
NBS Network analysis Clustering using non-negative matrix factorization 121
Oncodrive-CIS and OncodriveCLUST Mutation analysis Z-statistics for copy numbers of driver genes 171,172
PARADIGM Gene expression analysis Network analysis of gene expression 126
PathScan Pathway analysis Probability model for mutation-enriched pathways 109
TieDIE Network analysis Network diffusion model linking mutations to gene expression 125

根据经验,由多个独立算法call出来的候选事件不太可能是假阳性,而由任何单个算法call出来的候选事件则反之。因此,使用multicaller strategies现在变得更加普遍,当然这样做也会影响结果的灵敏度。但是各类工具的组合数量太庞大了,较难实现。

  • 2.1 SNV detection
    SNV检测算法:GATK、VarScan、SAMtools、SomaticSniper、MuTect、Strelka、JointSNVMix和SNVMix。前三种方法能够同时处理germline and somatic variants,其他几种方法用来call somatic mutations using tumour and matched normal genomic sequences.
    尽管在生殖系样本中杂合子VAFs(variant allele fraction)预计为50%,但这一数字不适用于肿瘤中的体细胞突变,主要原因是正常组织污染和/或肿瘤异质性。目前,算法开发的重点是在广泛的VAFs上处理体细胞突变。例如Bassovac算法,它在call变异时考虑了双向杂质和肿瘤亚克隆结构(即异质性)的影响。
  • 2.2 Indel detection
    Indel detection is still challenging, mainly owing both to their lower frequencies than those of SNVs and to mapping difficulties.
    大多数工具默认允许two mismatches and no gaps in 'seeded' regions (that is, in the first 28 bp in a read), 从而导致了包含indel的序列无法正常比对。Paired-end mapping对于发现末端再翼侧的大片段indel很有帮助,Gapped alignment, split read and de novo assembly 是目前常见的检测indel的方法。VarScan25 and GATK Unified Genotyper are based on heuristics for indel calling using raw statistics such as coverage, number of indel-supporting reads, read mapping qualities and mismatch counts.
    现有的许多工具对短indels (< 5-8 bp)检测效果较好,但缺乏高的阳性率。此外,他们通常无法检测中等大小的indel,包括一些已知的'druggable' and/or prognostic events。 最后,低复杂度区域(如均聚物)的检测尤其具有挑战性。SAMtools、Dindel可以call出短indel,Pindel、DELLY8采用了一种借鉴蛋白质数据分析的模式生长方法来检测indel断点,Pindel具有较高的精度,Burrows Wheeler aligner (BWA)-MEM41允许更好地发现长indels和SV, local de novo assembly or multiple alignments可以减少假阳性indel的数量。
  • 2.3 CNA and structural variant detection
    Accurate inference of copy number from sequence data requires normalization procedures that consider certain biases inherent to short-read sequencing methods (such as GC content and library biases). Approaches have been implemented for both GC-based coverage normalization and mapping bias.
    寻找复发的CNA:Genomic identification of significant targets in cancer (GISTIC) and correlation matrix diagonal segmentation (CMDS) have been developed for the identification of recurrent CNAs.
    检测多种结构变化(缺失、串联或反向复制、倒置、插入和易位):BreakDancer, CREST (clipping reveals structure), VariationHunter, geometric analysis of structural variants (GASV)-Pro,and Genome STRucture In Populations (Genome STRiP)
  • 2.4 Gene fusion detection
    RNA-Seq发现基因融合:TopHat-fusion、 deFuse、MapSplice、ChimeraScan、 BreakFusion
    基因融合既可以发生在只涉及两个远端loci的简单易位,也可以由多个远端loci组成复杂重排:Comrad and nFuse,这两种方法都将原始WGS和RNA-seq序列进行比对,同时验证融合和基因组断点。
    ComradnFuse可以解释不明确的读取对齐,因此可以最小化由不对齐引起的错误。
    我们最近开发了BreakTrans,它联合分析WGS和RNA-seq数据,以测试其他工具(如TopHat-fusion、MapSplice、BreakDancer和CREST)产生的假设,以进一步描述基因融合的机制成分。

3. Driver mutations and pathways

  • 3.1 Annotations and functional predictions
    RefSeq基因和转录本:Ensembl和GENCODE
    调控元件:ENCODE、TransFac和RegulomeDB
    非编码RNA:NONCODE、BodyMap和miRBase
    蛋白质注释:Pfam和Interpro
    综合注释:ANNOVAR和SNPeff提供转录变异的注释,SKIPPY预测隐性剪接效应因子,VEP、FunSeq和SNPnexus均扩展支持,包括非编码元素和调控特性的注释,VAAST(变异注释、分析和搜索工具)和GEMINI(基因组挖掘)允许对编码变异、非编码变异、调控元件和表型进行全面分析和整合
    有害性:PolyPhen、SIFT、MutationAssessor和Condel
    蛋白质翻译后修饰:ActiveDriver
  • 3.2 Significantly mutated genes
    检测Driver mutation的一个方法是区分掉背景突变率BMR。BMR的测量比较困难,许多因素可以影响BMR(包括基因长度、表达水平和复制时间的差异), variation among samples and errors in upstream analyses. BMR不仅在同一癌症类型的患者之间存在差异,而且可能与环境因素和病毒特征有关的不同癌症类型也有关。最后,对突变的不正确或有偏倚的注释可能会导致假阳性。基因序列覆盖不足加剧了这些问题。MuSiCMutSig可以解决这些问题。
    另一种用于区分司机突变和乘客突变的方法是检查突变是否聚集在蛋白质序列的特定残基上。The '20/20 rule' 建议,如果一个基因至少20%的错义突变(or identical in-frame indels)位于一个特定的残基上,那么该基因应该被归类为致癌基因。相反,如果至少20%的突变处于失活状态(即无意义的移码、剪接位点或终止密码子读取突变),则基因可以被归类为肿瘤抑制因子。现在,这一方法被一些算法所补充,这些算法利用更严格的统计分数来评估突变信号的模式,以及蛋白质序列或三维蛋白质结构突变的聚类。
  • 3.3 Pathway and network analyses
    通路和网络分析: 1.分析已知通路, which are represented as gene sets, 2.分析交互作用网络to implicitly build pathways de novo.
    方法1:评估突变基因组合的一种直接方法是检查突变基因列表与已知生物功能的预定义基因集之间的重叠:KEGGGOMSigDB。例如,假设我们有一个突变基因列表(M),我们的目标是看看这个列表中是否包含调控细胞周期的基因,利用KEGG数据库,我们发现了20多个细胞周期基因(L)的列表,有两个统计检验可以用来检验M和L是否有显著重叠。首先,如果对M进行排序(例如,使用上面描述的突变显著性评分之一),那么可以使用基因集富集分析(GSEA)来确定L中的基因是否接近排序列表的顶部(M);其次,如果M未排序,则可以使用超几何检验评估M和L之间的重叠。
    方法2:以上分析方法的缺陷:1. Human gene annotations and pathway databases remain incomplete, and there is extensive crosstalk between pathways, which implies that decisions regarding the genes that form the boundary of a pathway are arbitrary to some extent. 2. The crosstalk is represented in gene-set and pathway databases by the presence of multiple overlapping gene sets, thus complicating the interpretation of reported enrichments. 3. Finally, signalling and regulatory pathways have a rich topology of activating and inhibitory interactions, and this information is not represented in the list of genes or proteins that are members of the pathway,激活和抑制作用无法通过富集分析体现。为了克服这些限制,分析突变组合的第二种方法是使用生物相互作用网络:相互作用网络已被用来取代基因集,以确定应进一步评估的突变组合。然而,大多数生物网络具有不均匀的拓扑结构,其特征是中心或节点的存在。HotNet是一种查找大型交互网络的子网络的方法,该子网络在随机样本中发生的变异比预期的要多,HotNet已被用于确定几种癌症类型的子网络,这些子网络在TCGA的背景下进行了分析,例如,涉及卵巢癌中Notch信号通路的突变。还有一些其他工具,如network-based stratification (NBS)、MEMo、Tied Diffusion Through Interacting Events (TieDIE)等。
    方法3:第三种用于分析突变组合的方法是识别相互排斥的突变集。人们可以通过识别相互排斥的突变集来找到驱动突变的组合。MEMo使用这个概念来检测已知相互作用的基因,或者,可以尝试在不预先限制基因集的情况下重新发现相互排斥的基因集(Dendrix、Multi-Dendrix、RME)。

4. Genome integrity and clonal architectures

  • 4.1 Kataegis, chromothripsis and chromoplexy
    TCGA中最引人注目的发现之一是具有极端数量和突变类型的基因组。
    Kataegis is the occurrence of an unusually large number of SNPs clustered in a single locus, and was first reported in breast tumours and other cancer types.
    chromothripsis, in which one or more loci undergo a catastrophic event of simultaneous breakage and aberrant repair at multiple breakpoints in a single cell division,chromothripsis was originally reported in ~2–3% of all cancers but was shown to be particularly common in bone cancers (~25%),后来发现可能与TP53突变有关。chromoplexy是在前列腺癌中发现的类似事件。
  • 4.2 Defining clonal architecture in heterogeneous tumours
    以上讨论的所有基因组改变都在克隆进化中发挥作用。
    ABSOLUTE增加了一个最佳拟合CNA模型和一个核型似然模型
    PyClone使用分层贝叶斯聚类来识别克隆
    SciClone使用贝叶斯混合模型来检查来自患者的多个样本(使用初始和复发的肿瘤样本)或空间(使用多个活检样本)
    肿瘤异质性分析(THetA)算法解释了CNAs的存在,这使得VAFs的分析变得混乱

5. Conclusion: basic and clinical applications

在癌症基因组学进入生物医学领域的短短时间内,它做出了许多基础性的贡献:
首先,癌症相关基因和途径已被确定;
其次,已经建立了胚系的易感性;
三是技术和算法不断完善;
第四,组织和记录了大量的数据集;
最后,知识被分类到新的数据库中。
未来的挑战:
'data spectrum' and associated analysis tools are not yet complete,如蛋白质组数据;
The second factor is the reality of cost;
癌症研究的下一个篇章无疑将进一步推动临床应用,并使大型制药公司更多地参与开发新的治疗药物。

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 194,670评论 5 460
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 81,928评论 2 371
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 141,926评论 0 320
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 52,238评论 1 263
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 61,112评论 4 356
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 46,138评论 1 272
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 36,545评论 3 381
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 35,232评论 0 253
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 39,496评论 1 290
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 34,596评论 2 310
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 36,369评论 1 326
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 32,226评论 3 313
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 37,600评论 3 299
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 28,906评论 0 17
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 30,185评论 1 250
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 41,516评论 2 341
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 40,721评论 2 335

推荐阅读更多精彩内容

  • The Inner Game of Tennis W Timothy Gallwey Jonathan Cape ...
    网事_79a3阅读 11,505评论 2 19
  • rljs by sennchi Timeline of History Part One The Cognitiv...
    sennchi阅读 7,258评论 0 10
  • 01 女儿的8000元助学贷款给还上了。 02 走了一万六千多步。 03 任务完成之后一定要整理好,再上交。整理,...
    whp一生平安阅读 104评论 0 0
  • 开课留影 文/非象 已经在简书学堂上过课,也买了些写作书在看。想着不会再网上报名学什么的,可是无戒老师的365天,...
    非象阅读 162评论 0 1
  • 认识这个词(基础篇) 词:characterize英英释义:to be typical of a person, ...
    Yvettetaitai阅读 1,179评论 0 0