三月week3文献阅读:Chromatin organization is a major influence on regional mutation rates in human cancer cells.
染色质组织是影响人类癌细胞局部突变率的重要因素。
生物信息学方法:使用公共数据中的癌症变异、表观遗传等数据,发现紧密染色质与变异频率相关,回归模型的R^2能解释55%的变异差异。主要使用相关性、热图、线性回归等分析。亮点是使用了当时最早发布的癌症全基因组变异数据。
摘要:
Cancer genome sequencing provides the first direct information on how mutation rates vary across the human genome in somatic cells.
癌症基因组测序提供了关于体细胞基因组突变率如何变化的第一个直接信息。
Testing diverse genetic and epigenetic features, here we show that mutation rates in cancer genomes are strikingly related to chromatin organization.
通过测试不同的遗传和表观遗传特征,我们发现癌症基因组的突变率与染色质组织有着惊人的关联。
(突变率的检测和体细胞的细胞状态,遗传和表观遗传特质(观察染色质形态))
Indeed, at the megabase scale, a single feature—levels of the heterochromatin-associated histone modification H3K9me3—can account for more than 40% of mutationrate variation, and combination of features can account for more than 55%.
事实上,在megabase尺度上,一个单一的特征—异染色质相关组蛋白修饰h3k9me3的水平—可以解释突变率变异的40%以上,而特征的组合可以解释55%以上。
The strong association between mutation rates and chromatin organization is upheld in samples from different tissues and for different mutation types
在不同组织和不同突变类型的样本中,突变率和染色质组织之间存在着很强的相关性。
This suggests that the arrangement of the genome into heterochromatin- and euchromatin-like domains is a dominant influence on regional mutation-rate variation in human somatic cells.
这表明基因组排列成异染色质和常染色质样结构域对人体体细胞的区域变异率变化具有重要影响。
第一段:
Comparative genomics and population studies suggest that human germline mutation rates are not constant across the genome.
比较基因组学和人口研究表明,人类种系突变率在整个基因组中不是恒定的。
Many genetic and epigenetic properties have been proposed to influence the rate of single nucleotide changes, including local base composition, DNA replication timing, chromatin structure and the formation of double-strand breaks.
许多遗传和表观遗传特性被认为影响单核苷酸变化的速率,包括局部碱基组成、DNA复制时间、染色质结构和双链断裂的形成。
The sequencing of cancer genomes provides a unique opportunity to assess directly how mutation rates vary across the human genome;
癌症基因组测序提供了一个独特的机会,可以直接评估人类基因组的突变率是如何变化的;
by subtracting base changes observed in normal tissue from the same individual, a set of somatic single nucleotide variants (SNVs) can be derived.
通过减去来自同一个体的正常组织中观察到的碱基变化,可以得到一组体细胞单核苷酸变异(SNVs)。
Moreover, the large number of genome-scale data sets available for human somatic cells enables the investigation of potential causes of mutation rate variation.
此外,大量的基因组规模的数据集可用于人体体细胞,使研究潜在的突变率变异的原因。
It has been noted that tumours from different tissues show biases in mutation type and evidence of transcription-coupled repair.
已经注意到,不同组织的肿瘤在突变类型和转录耦合修复的证据方面存在偏差。
In addition, another study showed that, at the 1-megabase (Mb) scale, there is substantial variation in the density of somatic mutations along the human genome and, moreover, that this regional variation in mutation density was correlated in three different tumour genomes.
此外,另一项研究表明,在1-megabase (Mb)尺度上,沿着人类基因组的体细胞突变密度有很大的变化,而且,这种突变密度的区域变化与三个不同的肿瘤基因组相关。
They also showed that somatic mutation rates measured in cancer genomes moderately correlate with those inferred in the germline from human–chimp sequence divergence (Supplementary Fig. 1).
他们还表明,在癌症基因组中测量到的体细胞突变率与从人类-黑猩猩序列差异推断出的种质系的突变率有一定的相关性(补充图1)。
However,sofar,the individual feature sassociated with mutation-rate variation explain very little of the regional variance across the genome.
然而,到目前为止,与变异率变异相关的个体特征几乎不能解释整个基因组的区域差异。
(与体细胞突变率变异的相关因素:局部碱基组成、DNA复制时间、染色质结构和双链断裂的形成。目前局限:与变异率变异相关的个体特征几乎不能解释整个基因组的区域差异)
第二段:
We gathered a total of 84,879 unique SNV positions from individual leukaemia, melanoma, small cell lung cancer and prostate cancer genomes.
我们收集了个体白血病、黑色素瘤、小细胞肺癌和前列腺癌基因组的84,879个特异性SNV位点。
To identify potential causes of mutation-rate variation across the genome,we compiled a set of diverse genetic and epigenetic featuresthathave been mapped genome-wideinhumancells.
为了找出整个基因组变异率变化的潜在原因,我们可以在基因组范围内绘制出不同的遗传学和深遗传学特征。
In total we examined 46 features, including base composition, CpG content, gene density, DNA replication timing, nucleosome occupancy, long-range chromatin interactions (Hi-C), recombination rate, the density of unique sequences (mappability of 24-base polymers), levels of 18 histone acetylations, levels of 17 histone methylations, and occupancy of RNA polymerase II, the CTCF insulator protein and the histone variant H2AZ.
总共我们检查了46个功能,包括基本组成、CpG内容,基因密度、DNA复制时间,核小体的入住率,远程染色质交互(高c)复合率,独特的密度序列(mappability 24-base聚合物),18组蛋白乙酰化水平,水平的17组蛋白甲基化和RNA聚合酶II的入住率,H2AZ CTCF绝缘子蛋白质和组蛋白变体。
We then calculated the correlation coefficient for all pairwise combinations of features, including cancer SNV density, human–chimp divergence and germline single nucleotide polymorphism (SNP) density, and clustered the features using these correlation coefficients as a distance metric.
然后,我们计算了所有特征成对组合的相关系数,包括癌症SNV密度、人类-黑猩猩差异和种系单核苷酸多态性(SNP)密度,并使用这些相关系数作为距离度量对特征进行聚类。
(基因组变异率变化的潜在原因,分析的数据集准备,表观的一些分类)
第三段:
Surprisingly,we found that at the megabase scale,cancer SNV density is strikingly correlated with many features of somatic cell chromatin organization (Fig. 1). The feature most strongly correlated with cancer SNV density is the repressive histone modification H3K9me3(r=0.64, P<2.2*10^-16).
令人惊讶的是,我们发现在megabase尺度下,肿瘤SNV密度与体细胞染色质组织的许多特征显著相关:(fig.1),与肿瘤SNV密度相关性最强的特征是抑制性组蛋白修饰****H3K9me3*(r=0.64, P<2.210^-16)。
Positive correlations are also observed with other repressive marks including H3K9me2 (r=0.53, P<2.2310^-16) and H4K20me3 (r=0.39, P<2.2310^-16).
H3K9me2 (r=0.53, P<2.2310^-16)和H4K20me3* ((r=0.39, P<2.23*10^-16)也与其他压抑性标记呈正相关。
In contrast, cancer SNV density anti-correlates with levels of many histone modifications associated with open chromatin, such as H3K4me3 (r=0.59, P<2.2310^-16) and H3K9ac (r=0.59, P<2.2310^-16).
相比之下,癌症SNV密度抗药与许多与开放染色质相关的组蛋白修饰水平相关,如H3K4me3 (r=0.59, P<2.2310^-16)和H3K9ac (r=0.59, P<2.2310^-16)。
More moderate anti-correlation is observed with GC content (r=20.47, P<2.2310^-16), gene density (r=0.42, P<2.2310^-16), early replication timing ((r=0.30, P<2.2310^-16) and the density of highly positioned nucleosomes (r=0.43, P<2.2310^-16).
GC含量(r=20.47, P<2.2310^-16)、基因密度(r=0.42, P<2.2310^-16)、早期复制时间(r=0.30, P<2.2310^-16)和高定位核小体密度(r=0.43, P<2.2310^-16)具有较好的抗相关性。
These conclusions are upheld when using alternative genomic window sizes (Fig. 1a): for example, the correlation with H3K9me3 is 0.37 at 100 -kilobase resolution and 0.69 at10-Mb resolution.
这些结论在使用其他基因组窗口大小时得到了支持(fig.1a):例如,在100 - kb分辨率下,H3K9me3与H3K9me3的相关性为0.37,在10- mb分辨率下与H3K9me3的相关性为0.69。
We note that the weaker correlations when considering smaller window sizes may be due to the low median number of SNVs per window (Supplementary Fig.10).
我们注意到,当考虑较小的窗口大小时,相关性较弱,这可能是因为每个窗口的SNVs中值较低(补充图10)。
Taken together this shows that,at least at the megabase scale, regional mutation-rate variation is strongly associated with regional variation in chromatin organization.
综上所述,至少在MB尺度上,区域突变率的变化与染色质组织的区域变化密切相关。
(不同尺度上,突变率和染色质组织的相关性,至少在MB尺度上,区域突变率的变化与染色质组织的区域变化密切相关。)
第四段:
We used principal component analysis to investigate further the inter-dependencies among the different chromosome features.
我们利用主成分分析进一步研究了不同染色体特征之间的相互依赖关系。(表观特征之前的相互作用关系)
This revealed that at 1-Mb resolution, nearly 60% of the variance in these diverse features could be accounted for by a first principal component (Supplementary Fig. 3b).
这表明,在1 mb分辨率下,这些不同特征可以由第一个主成分解释中有近60%的方差(补充图3b)。
That is, the regional variation in many different genetic and epigenetic features can be captured by variation in a single underlying component along the genome.
也就是说,许多不同的遗传和表观遗传特征的区域变异可以通过基因组中单个潜在成分的变异来捕获。
Features with a strong loading on this first principal component include many histone modifications and other features classically associated with either highly accessible euchromatin or inaccessible heterochromatin (SupplementaryFig.3a).
第一个主成分上的强负荷特征包括许多组蛋白修饰,以及与高易接近的常染色质或难接近的异染色质经典相关的其他特征(补充图3a)。
For example,the histone modifications H3K9me3 and H4K20me3 have strong negative loadings on this component,and GC content,gene density,early DNA replication and many activation marks show strong positive loadings (Supplementary Fig. 3a).
例如,组蛋白修饰H3K9me3和H4K20me3对该组分具有较强的负负荷量,GC含量、基因密度、早期DNA复制和许多活化标记均表现出较强的正负荷量(补充图3a)。
Cancer SNV density also has a strong negative loading on this first component, consistent with the idea that somatic mutation rates in cancer cells are highest in inaccessible, heterochromatin-like regions and lowest in accessible euchromatin-like domains.
肿瘤SNV密度对第一个组分也有很强的负负荷,这与癌细胞的体细胞突变率在难以接近的、异染色质样区域最高,而在可接近的常染色质样区域最低的观点一致。
In contrast, germline SNP density and human–chimp divergence have stronger loadings on the second orthogonal principal component (Supplementary Fig. 3). Indeed, consistent with previous findings and an important role for background selection and/or genetic hitchhiking in determining human diversity levels, germline SNP density is most positively correlated with the rate of recombination during meiosis (r=0.45 P <2.23*10^-16).
相比之下,生殖系SNP密度和human-chimp差异有更强的第二个正交的主成分载荷(补充图.3)。实际上,符合之前的发现,背景选择和/或遗传搭车是一个重要的角色在决定人类的多样性水平,生殖系SNP密度是最与减数分裂重组率呈正相关(r=0.45 P <2.23*10^-16)。
(主成分分析进一步研究了不同染色体特征之间的相互依赖关系,不同特征之间的联合对突变影响的分析,图表分析,文献论证。)
第五段:
To confirm that our conclusions were not tumour-type-specific, we also analysed the mutations from each cancer sample in isolation.
为了证实我们的结论不是肿瘤类型特异性的,我们还单独分析了每个癌症样本的突变。
The individual samples derived from distinct tissues and showed signatures of exposure to different environmental mutagens such as ultraviolet radiation in the melanoma and carcinogens from tobacco smoke in the lung cancer sample.
这些来自不同组织的个体样本显示出暴露于不同环境诱变剂的特征,例如黑色素瘤中的紫外线辐射和肺癌样本中的烟草烟雾致癌物质。
However, SNV density is positively correlated with levels of H3K9me3 and negatively correlated with many features associated with open chromatin in each of the individual tumour samples, supporting the generality of our findings (Fig. 2)
然而,SNV密度与H3K9me3水平呈正相关,与每个肿瘤样本中开放染色质相关的许多特征呈负相关,支持了我们的发现的普遍性(fig.2)。
(排除结论不是肿瘤类型特异性,数据分析说明,图分析得到结果)
第六段:
Considering transition mutations separately from transversions, or CpG mutations separately from non-CpG mutations, also does not change our conclusions: elevated mutation rates are strongly associated with H3K9me3 (Fig. 3a) and other indicators of heterochromatin (Supplementary Fig. 11), irrespective of the mutation type.
考虑转移与转位的分离,或CpG突变与非CpG突变的分离,也不会改变我们的结论:无论突变类型如何,高突变率与H3K9me3(图3a)和其他异染色质指标(补充图11)密切相关。
(考虑其它因素对实验结论的影响)
Likewise, the association remains strong when considering only non-genic or only genic regions of the genome (Fig. 3b), so cannot be accounted for by transcription- or expression-coupled repair.
同样,当只考虑基因组的非基因或基因区域时,这种相关性仍然很强(图3b),因此不能通过转录或表达耦合修复来解释。
The correlation is also strong when only considering SNVs surrounded by unique sequence (Fig. 3c and Supplementary Fig. 9), after filtering out regions of the genome with extreme GC content (Fig. 3c), and when excluding evolutionarily conserved bases (Fig. 3c).
当只考虑被唯一序列包围的SNVs(图3c和补充图9),过滤掉基因组中GC含量极高的区域(图3c),以及剔除进化保守碱基(图3c)时,相关性也很强。
The association between chromatin organization and mutation-rate variation is therefore upheld for diverse tissue types, diverse mutation types and diverse genomic regions.
因此,不同的组织类型、不同的突变类型和不同的基因组区域支持染色质组织和突变率变异之间的联系。
(考虑其它因素时,相关性的情况,总结:不同的组织类型、不同的突变类型和不同的基因组区域支持染色质组织和突变率变异之间的联系)
第七段:
Last, we examined the extent to which predictions of mutation-rate variation could be improved by combining the information from multiple genomic features.
最后,我们检验了通过结合来自多个基因组特征的信息可以在多大程度上改进对突变率变化的预测。
We used an iterative procedure to identify the combination of features that provide the best predictions in multiple linear regression models using increasing numbers of features (see Methods) and found that more than 55% of the variance in cancer SNV density along the genome could be explained by combining features (Fig. 4).
我们使用一个迭代过程来识别的结合特性,提供了最好的预测多元线性回归模型使用越来越多的特性(见方法),发现结合特性超过55%的方差在癌症SNV密度沿基因组可以解释(图4)。
In contrast, predictions of germline SNP density or human–chimp divergence never accounted for more than 35% of the variance,with recombination rate alone accounting for 20.5% of the observed variance in germline SNP density (Supplementary Fig. 4). In the cancer cells, H3K9me3 alone can account for more than 40% of the observed variance in SNV density.
相比之下,生殖系SNP密度或human-chimp散度的预测从未占超过35%的方差,重组率仅占20.5%的观测到的生殖系SNP密度(补充图。4)。在癌症细胞,H3K9me3可以独自占40%以上的观察SNV密度差异。
The remaining predictive features included in the models mark regions with open chromatin, or in the case of Hi-C, further distinguish the compartmental organization of the genome.
剩下的预测特征包括在模型中标记染色质开放的区域,或者在Hi-C的情况下,进一步区分基因组的分隔组织。
The Hi-C metric used here is a measure devised previously and uses genome-wide data on physical contacts between regions through three dimensional folding of the chromosome.
这里使用的Hi-C度规是以前设计的一种测量方法,它使用全基因组的数据,通过染色体的三维折叠来测量区域之间的物理接触。
It distinguishes between densely packed chromatin with strong shortrange interactions and accessible euchromatin with a more diverse interaction pattern.
它区分了具有强短距离相互作用的密集填充染色质和具有更多样化相互作用模式的可接近的常染色质。
The Hi-C metric anti-correlates with somatic SNV density (r 5 20.55, P , 2.2 3 10216), further supporting a model in which chromatin organization is a major determinant of variation in regional mutation rate.
Hi-C度抗与体细胞SNV密度相关(r5 20.55, P, 2.2 3 10216),进一步支持了染色质组织是区域突变率变化的主要决定因素的模型.
(模型开发:基因组特征的信息,突变率变化的预测)
第八段:
However, at least in cancer cells, our analyses indicate that the dominant determinant of regional mutation rate variation is chromatin organization, with mutation rates elevated in more heterochromatinlike domains and repressed in more open chromatin.
然而,至少在癌细胞中,我们的分析表明,区域突变率变化的主要决定因素是染色质组织,突变率在更多的异染色质样结构域升高,而在更开放的染色质中受到抑制。
This could reflect differing accessibility to DNA repair complexes, variation in the ability to signal repair or perhaps increased exposure to mutagens at the nuclear periphery.
这可能反映了不同的DNA修复复合物的可达性、信号修复能力的变化,或者可能是核外围突变体暴露的增加。
The somatic mutations considered here all arose in lineages that ultimately gave rise to tumours;
这里考虑的体细胞突变都出现在最终导致肿瘤的谱系中;
although the mutation process may be different in tumour lineages, the tumours analysed here derive from diverse tissue types, which suggests the intriguing possibility that chromatin organization will be a major influence on regional mutation rates in all human somatic cells.
虽然突变过程可能在肿瘤谱系中有所不同,但这里分析的肿瘤来自不同的组织类型,这表明染色质组织将对所有人类体细胞的区域突变率产生重要影响,这是一种有趣的可能性。
(结论:区域突变率变化的主要决定因素是染色质组织,突变率在更多的异染色质样结构域升高,而在更开放的染色质中受到抑制)
METHODS SUMMARY
SNVs of human cancer cells were obtained from recent publications1–5.
从最近发表的文献1 - 5中获得了人类癌细胞的SNVs。
Germline polymorphisms were taken from dbSNP26 and the 1000 genomes project27.
从dbSNP26和1000个基因组工程27中提取了种系多态性。
Ensembl Compara provided the human–chimp divergence data28.
Ensembl Compara提供了人类和黑猩猩的差异数据a28。
Histone methylation20 and acetylation19 states were mapped to the genome, as well as an array of additional genomic feature sets from various sources: recombination rate17, nucleosome positioning15,29, spatial proximity16, replication timing14, gene density and evolutionary conservation30.
组蛋白甲基化和乙酰化状态被映射到基因组,以及一系列来自不同来源的额外基因组特征集:重组率、核小体定位、空间邻近性、复制时间、基因密度和进化保守性。
Genomes were then split into evenly sized windows, and windows with a high repeat content18 were excluded to calculate Pearson correlations between features.
然后将基因组分成大小均匀的窗口,剔除重复内容高的窗口,计算特征之间的Pearson相关性。
文章详情:
https://www.nature.com/articles/nature11273
数据分析思路:
数据分析思路(假说)是怎么得到和推进的? 通过测试不同的遗传和表观遗传特征,我们发现癌症基因组的突变率与染色质组织有着惊人的关联,使用回归模型进行分析,使用单个或多个组个特征来解释变异差异。
染色质组织和变异频率是否是因果关系?回归分析是相关关系。第二段和第三段
如何用实验或其他生物信息分析验证? 检查模型模拟,R^2 统计量,Y中的差异能被Xj线性组合后解释的部分所占的比例。