Deep learning using tumor HLA peptide mass spectrometry datasets improves neoantigen identification
题目:利用肿瘤白细胞抗原的质谱数据集进行深度学习来提高癌症抗原的识别
作者及单位:
Brendan Bulik-Sullivan, Jennifer Busby, […], Roman Yelensky
- Gritstone Oncology, Inc., Emeryville, California and Cambridge, Massachusetts, USA.
发表刊物及时间:
Nature Biotechnology Published: 17 December 2018
摘要
==Neoantigens==(新生抗原), which are expressed on tumor cells, are one of the main targets of an effective antitumor T-cell response. Cancer immunotherapies to target neoantigens are of growing interest and are in ==early human trials==(人类临床实验早期), but methods to identify neoantigens either require invasive or difficult-to-obtain clinical specimens, require the screening of hundreds to thousands of synthetic peptides or tandem minigenes, or are only relevant to specific human leukocyte antigen (HLA) alleles. We apply deep learning to a large (N = 74 patients) HLA peptide and genomic dataset from various human tumors to create a computational model of antigen presentation for neoantigen prediction. We show that our model, named EDGE, increases the positive predictive value of HLA antigen prediction by up to ninefold. We apply EDGE to enable identification of neoantigens and ==neoantigen-reactive T cells== (新抗原反应性T细胞 ) using routine clinical specimens and small numbers of synthetic peptides for most common HLA alleles. EDGE could enable an improved ability to develop neoantigen-targeted immunotherapies for cancer patients.
在肿瘤细胞上表达的新抗原是有效抗肿瘤 T 细胞应答的主要靶点之一。 新抗原靶向癌症免疫疗法越来越受到人们的关注, 并且应用在人类临床实验早期。 但是鉴定新抗原的方法需要侵入感染或难以获得的临床样本, 需要筛选成百上千个合成肽或串联小基因,或者只能够与特定的人类白细胞抗原(HLA)等位基因相关。我们将深度学习应用于来自不同人类肿瘤的大型(N=74 个患者) HLA 肽和基因组数据集,以创建用于新抗原预测的抗原呈现的计算模型。我们发现,我们的模型——EDGE,将 HLA 抗原预测的阳性预测值提高了 9 倍。 我们应用 EDGE 鉴定新抗原和新抗原反应性 T 细胞,使用常规临床标本和大多数常见 HLA 等位基因的少量合成肽。 EDGE 可以提高癌症患者开发新抗原靶向免疫疗法的能力
==Neoantigen==能被免疫细胞所识别、能被免疫系统所攻击的、由于癌细胞基因突变所导致的、正常细胞所没有的异常蛋白质,就是我们所要讨论的 Neoantigen。
图表选析:
Figure 1 : Tissue samples and data for model training.
Tissues of origin and numbers of tumor and normal samples used (image credit: Andrii Bezvershenko/Bigstock.com), with the key generated data types: HLA peptide sequences, HLA types and tissue transcriptome measurements. Frozen tissue specimens were pulverized and lysed, with lysis product subjected to HLA immunoprecipitation with antibody W6/32 and peptide sequencing, along with mRNA extraction and transcriptome sequencing. We obtained HLA types from exome or targeted sequencing of matched normal tissues and used the integrated dataset to train a deep learning model to predict HLA epitope presentation. MS, mass spectrometry; ==NGS, next-generation sequencing==; Comet, an open source MS/MS sequence database search tool; Percolator, semisupervised learning for peptide identification from shotgun proteomics datasets; OptiType, HLA typing from NGS data; STAR-RSEM, ==bioinformatics pipeline==(生物信息学流程) for estimating gene expression levels from RNA-seq data; MiSeq, NGS platform for ==targeted sequencing==(靶向测序); HiSeq, NGS platform for high-throughput sequencing.
图 1: 用于模型训练的组织样本和数据。肿瘤及正常样本来源及数量 (图片来源 :Andrii Bezvershenko/Bigstock.com),关键生成数据类型:HLA 肽序列、HLA 类型及组织转录组测量。 冷冻组织标本粉碎、裂解,裂解产物用抗体 w6/32 经 HLA 免疫沉淀,并进行肽段测序, 以 及 mRNA 提取和转录组测序。我们从外显子组或匹配正常组织的靶向测序中获得 HLA 类型, 利用整合数据集训练深度学习模型预测 HLA 表位表现。MS:质谱; NGS:新一代测序:Comet: 一个开源的 MS/MS 序列数据库搜索工具; Percolator: 基于散弹蛋白质组学的多肽识别的半监督学习数据集;OptiType,:从 NGS 数据获取的 HLA 分型; STAR-RSEM: 用于估计 RNAseq Miseq 基因表达水平的生物信息学流程;MiSeq,靶向测序的NGS平台 HiSeq: NGS 高通量测序平台。
Figure 2: Overview of the tumor peptidomics dataset.
(a) Peptide counts at various q-value thresholds from the 74 tumor mass spectrometry peptidomics samples. (b) Length distribution of presented peptides (FDR < 0.1). (c) Proportion of presented peptides with predicted binding affinities below various thresholds from 1 to 1,000 nM. Each blue line represents one of the 13 training set samples for which MHCflurry 1.2.0 binding affinity predictions were available for all HLA class I alleles; the red line shows the median across 13 samples. The dashed vertical lines show the common 500 nM and 50 nM 'binder' and 'strong binder' thresholds, respectively. (d) Relationship between peptide presentation and the RNA expression of the source gene measured in TPM for peptide lengths between 8 and 11. For each peptide length k, all possible peptides from all genes were grouped into 20 bins by TPM of the source gene and the proportion of k-mer peptides in each TPM bin that were detected via mass spectrometry is shown. (e) Genes with average expression across the 69 training set samples within narrow, 0.1 log10(TPM)-wide windows around 5, 10 and 100 TPM were selected. For each gene, the average prevalence of presentation of peptides of lengths 8–11 (i.e., the proportion of all possible peptides from that gene detected by mass spectrometry) from that gene across all training samples was computed. A histogram of this per-gene prevalence for all genes within each window is shown.
图2. 肿瘤多肽组数据集总览。(a) 74个肿瘤质谱多肽样本q值低于阈值的计数。 (b)所表达的多肽的长度分别。 (FDR<0.1) (c)在1~1000nM的不同阈值下, 预测结合亲和力低于阈值的多肽所占 比例。每一条蓝线代表13个训练集中的一个,它们是由MHCflurry 1.2.0预测的所有HLA 等位基因的亲和力预测值。红线代表13个样本的中值。 两条虚竖线分别代表500nM 普通结合肽与50nM强力结合肽。 (d) 肽段长度在8-11之间,多肽表达和以TPM度 量的源基因RNA表达的关系。 对于每一个肽段长度k,所有可能的、来自所有基因的多肽依据源基因RNA表达的TPM值被分为20组,在不同TPM值的组别中k-mer多肽的比例通过质谱分析来呈现。 (e)根据69个训练集样本的基因表达均值, 我们选择的基因值在5、 10和100TPM附近的0.1 log10(TPM)范围。 对于每个基因,计算出该基因 在所有训练样本中出现长度为8 - 11的多肽的平均流行率。 (即质谱法检测到的该基因中所 有可能多肽的比例)。每个图中显示了每个基因流行度的直方图。
Figure 3: Architecture and features of the model.
(a) The architecture of our neural network (NN), with the subcomponents of the network active in a single patient with six HLA alleles. Pr, probability. (b) The learned dependence of HLA presentation on each sequence position for peptides of lengths 8–11 for two common HLA alleles. See Supplementary Figure 3a, b, c for learned motifs for all alleles. (c) Observed (dark blue) values are the proportion all detected peptides in the test samples found at each peptide length. Predicted (light blue) values are the sum of probabilities of all proteome peptides of length k over the total sum of probabilities of all peptides of lengths 8–11 (i.e., the expected proportion of presented peptides of each length). (d) Observed (dark blue) values are the proportion all detected peptides in the test samples found from genes at each mRNA expression TPM level. Predicted (light blue) values are the sum of probabilities assigned to all proteome peptides at the TPM level over the total sum of probabilities of all peptides. (e) Test set prevalence of detected peptides binned by learned per-gene propensity of presentation (xaxis) and RNA expression (y-axis) of the source genes.
图3. 模型的体系结构及特征
(a) 我们神经网络(NN)的体系结构,在含有六个 HLA 等位基因的单个患者中, NN 子组件的活跃情况。Pr 表示概率。(b) 对于两个常见的 HLA 等位基因, 长度为 8-11 的肽的 HLA 呈现对每个序列位置的学习依赖性。所有等位基因的学习模块见补充图3a, b, c。(c) 观察值(深蓝色)是在每个肽长度处发现的测试样品中所有检测到的肽的比例。 预测值(浅蓝色)是在总的所有长度为8-11的肽段中,长度为k的所有蛋白 质组肽概率的总和(即每个长度的呈递肽的期望比例)。 (d) 观察值(深蓝色)是测试样品中所有检测到的肽在每个mRNA表达TPM水平的比例。预测值(浅蓝色)是在TPM水平上分配给所有蛋白质组肽 的概率与所有肽的概率总和的总和。 (e) 通过学习每个基因的呈递偏好(x轴)和RNA表达(y- 轴)将检测肽的数据分箱,得到的测试集普遍性。
Figure 5: Identification of neoantigen-reactive T cells from patients with non-small-cell lung cancer.
(a) Detection of T-cell responses to neoantigen peptide pools. In vitro–expanded patient PBMCs were stimulated with controls or patient-specific neoantigen peptide pools in IFN-γ ELISpot. Data are presented as spot-forming units (SFU) per 105 plated cells with background (corresponding DMSO controls) subtracted. Background measurements are shown in Supplementary Figure 11. For each patient, predicted neoantigens were combined into two pools of ten peptides each according to model ranking and any sequence homologies (homologous peptides were separated into different pools). ==In vitro==(在体外) T cell responses in single wells (1-038-001, CU02, CU03 and 1-050-001) or duplicates (all other samples) against cognate peptide pools 1 and 2 are shown for patients 1-038-001, 1-050-001, 1-001-002, CU04, 1-024-001, 1-024-002 and CU05. For patients CU02 and CU03, cell numbers allowed testing against specific peptide pool 1 only. Patients with in vitro T cell responses are represented in shades of blue; those without are represented in shades of orange and red. (b) Detection of T-cell responses to individual neoantigen peptides. In vitro–expanded patient PBMCs were stimulated in IFN-γ ELISpot with controls or patient-specific individual neoantigen peptides. In vitro T cell responses against cognate peptides are shown for patients whose cells showed positive responses against peptide pools (shown in shades of blue in a), along with, where cell numbers permitted, ==deconvolution==(解卷积) to individual peptides. Patients 1-038-001 and 1-024-001: data are presented as spot forming units (SFU) per 105plated cells for one visit, with background (corresponding DMSO controls) subtracted. Patients 1-024-002 and CU04: data are shown as cumulative (added) SFU for responses from three visits (CU04) or two visits (1-024-002). See also Supplementary Figure 10b. (c) Representative example of ELISpot wells from patient CU04 from data shown in a and b. Data were confirmed in an independent culture repeat (Supplementary Fig. 10c).
图 5:鉴定来自非小细胞肺癌患者的新抗原反应性T细胞
(a) T 细胞对新抗原肽库反应的检测。 体外扩增的患者外周血单核细胞(PBMCs)被 IFN-γ ELISpot 中对照组或病人特定的新抗原 肽库刺激。 数据 被展示为每 10 5 个铺板细胞的斑点形成单位(SFU),其中减去 背景(相应的 DMSO 对照)。 背景的测量结果在增补的图 11 中显示。对于每个病人, 根据模型排序和任意 序列同源性(同源的多肽被分成不同库)预测的新抗原被合并成 2 个各包含 10 个多肽的库。 在单孔(1-038-001, CU02, CU03 and 1-050-001) 或对同源肽库的重复(所有其他样本) 的 体 外 T 细 胞 反 应 中 , 1 和 2 展 示 了 病 人 1-038-001,1-050-001,1-001-002 , CU04,1-024-001,1-02-002 和 CU05. 对于 CU02 和 CU03 患者, 细胞数量只允许针对特定的肽 库 1 进行检测。体外 T 细胞反应患者用蓝色图形展示;没有 T 细胞反应的则用橙色和红色展 示。(b) T 细胞对单个新抗原肽的反应的检测。在 IFN-γELISpot 中, 体外扩增的患者 PBMC 被对照或患者特异性个体肿瘤抗原肽刺激。体外,患者(其 细胞对肽库有阳性反应(在 a 中以蓝色阴影显示))的同源肽的 T 细胞应答被展示,以及在允许细胞数量的情况下,对单个肽进 行解卷积。患者 1-038-001 和 1-024-001:数据以一次访问每 105 个层叠的细胞中的 SFU 并去掉背景(相应的 DMSO 控 制) 展示。患者 1-024-002 和 CU04:数据显示为三次访问(CU04) 或两次访问(1-024-002) 中反应累积 SFU。在图 10b 中也可以看到。 (c)在 a 和 b 中展示的数据中,患者 CU04 的 ELISpot 井的代表例子。数据在一个独立培养重复(图 10c) 中被证实。
讨论:
With the progress of cancer immunotherapy, identification of neoantigens and neoantigenrecognizing T cells has become a central challenge in assessing tumor responses2,33, examining tumor evolution34 and designing the next generation of personalized therapies3. Current neoantigen identification techniques are either time-consuming and laborious7,20 or insufficiently precise10,14,15,16. Although it has recently been demonstrated that neoantigen-recognizing T cells are a major component of TILs7,20,35 and circulate in the peripheral blood of cancer patients7, current methods for identifying neoantigen-reactive T cells have some combination of the following three limitations: they rely on difficult-to-obtain clinical specimens such as TILs20,21 or leukaphereses7, they require screening impractically large libraries of peptides19, or they rely on MHC multimers, which may practically be available for only a small number of MHC alleles.
随着肿瘤免疫治疗的进展,新抗原和新抗原识别T细胞已成为评估肿瘤应答、检测肿瘤进化及设计下一 代个性化治疗的主要挑战。目前的新抗原识别技术要么费时费力,要么耗费人力,或不够精确。尽管最近已经证明,新抗原识别的t细胞是 tumor-infiltrating lymphocytes(肿瘤浸润淋巴细胞)的主要组成部 分,并在癌症患者的外周血中循环,但目前识别新抗原反应性t细胞的方法有以下三个限制:它们依赖于 难以获得的临床标本,如肿瘤浸润淋巴细胞或白细胞,但它们需要筛选不实际的大型肽库。或者,它们 依赖于MHC多个等位基因,而MHC等位基因实际上可能只提供给少数几个MHC等位基因。
Here we demonstrate that all of these challenges can be addressed by improving the specificity of HLA epitope prediction algorithms by training models on mass spectrometry data instead of in vitro HLA–peptide binding affinity data. We generated the largest dataset of tumor HLA peptides, and HLA types reported to date, to our knowledge, and used these data to train a deep learning model of HLA peptide presentation. Using held-out mass spectrometry test data and retrospective neoantigen immune monitoring data, we demonstrate that our model, EDGE, outperforms state-of-the-art predictors trained on binding affinity and early predictors based on mass spectrometry peptide data by up to an order of magnitude, and show that the full scope of the predictive improvement is only achievable with the combination of several key modeling techniques. Finally, we show that by prioritizing peptides with prediction, it is possible to reliably identify neoantigen-specific T cells using a clinically practical process that requires only limited volumes of patient peripheral blood, screening a few peptides per patient, and does not rely on MHC multimers.
在这里,我们证明,所有这些挑战都可以通过提高Hla表位预测算法的特异性来解决,方法是在质谱数 据上训练模型,而不是在体外Hla-肽结合亲和力数据上进行训练。据我们所知,我们生成了最大的肿瘤 Hla肽数据集和迄今报道的Hla类型,并利用这些数据训练了Hla肽表达的深度学习模型。利用保留的质谱 测试数据和回顾性新抗原免疫监测数据,我们证明,我们的模型, EDGE,在结合亲和力和基于质谱肽 数据的早期预测器方面训练的先进水平优于基于质谱肽数据的预测器,并表明只有将几种关键建模技术 结合起来,才能实现预测改进的全部范围。最后,我们表明,通过对多肽进行预测排序,可以使用一种 临床实用的方法可靠地识别新抗原特异性T细胞,这一过程只需要有限的患者外周血,每名患者只需筛 选几个多肽,而不依赖MHC多倍体。
Critically, this improved performance for neoantigen identification is achieved by training a predictor based on data acquired by standard data-dependent acquisition mass spectrometry, despite this technique having insufficient sensitivity to detect all neoantigens directly14. Although targeted mass spectrometry approaches36 may ultimately improve sensitivity for direct neoantigen identification when sufficient tissue is available, our results highlight the synergy possible between analytical and computational tools.
关键的是,这种改进的新抗原识别性能是通过训练一个基于标准数据依赖的获取质谱数据的预测器来实 现的,尽管这种技术没有足够的灵敏度直接检测所有的新抗原。虽然靶向质谱方法可能最终会提高直接 新抗原识别的敏感性,当有足够的组织可用时,我们的结果突出了分析工具和计算工具之间的协同作 用。
A clear limitation of our work is that it does not currently address prediction of HLA class II binding epitopes presented to CD4+ cells, which provide help for CD8+ T cell responses and can also exert antitumor activity directly. We chose to focus our study first on class I CD8+ T-cell epitopes because class I HLA expression is substantially more abundant than class II expression on solid tumor cells37,38 and class I–only antigen targeted immunotherapy has been shown to result in durable solid tumor regression with adoptive cell therapy39. Furthermore, CD8+ T cell responses to neoantigens have now been linked to clinical efficacy of immune checkpoint inhibition1,2 Nevertheless, we expect that our modeling approach is applicable to the prediction of class II epitopes. To demonstrate this possibility, we trained and successfully tested our prediction model using a (currently limited) publicly available class II HLA peptide dataset (Supplementary Fig. 12).
我们的工作的一个明显的局限性是,它目前没有解决预测HlaⅡ类结合表位呈现到CD4细胞,这提供了帮 助CD8 t细胞的反应,也可以直接发挥抗肿瘤活性。我们选择首先研究Ⅰ类CD8-T细胞表位,因为Ⅰ类hla 的表达比II类在实体肿瘤细胞上的表达要丰富得多和I类单一抗原靶向免疫治疗已被证明是持久的实体肿 瘤退行性变与过继性细胞治疗。此外, CD8 T细胞对新抗原的反应与免疫检查点抑制的临床疗效有关。 然而,我们期望我们的建模方法适用于II类表位的预测。为了证明这种可能性,我们训练并成功地测试 了我们的预测模型,使用一个(目前有限的)公开的II类HLA肽数据集(补充图)。 12)
Another limitation of our modeling approach is that it does not incorporate TCR binding or the availability of T-cell precursors. For example, it has been proposed that some peptide sequences may have biophysical properties that hinder TCR recognition in general40, or that some neoantigens are more self-similar than others, such that the T-cell clones that recognize them are deleted by central tolerance41. Addressing TCR binding is an important direction for future research in neoepitope prediction; however, the predictive performance of our model on the TIL neoepitope dataset and the prospective neoantigen-reactive T cell identification task demonstrate that although modeling TCR binding may provide additional benefit42,43, it is now possible to obtain therapeutically useful neoepitope predictions by modeling only HLA processing and presentation. In summary, this work offers practical neoantigen identification from routine patient samples and should be useful for the design and evaluation of future cancer immunotherapies.
我们建模方法的另一个限制是它不包含TCR结合或T细胞前体的可用性。例如,有人提出,某些肽序列 可能具有生物物理性质,在一般情况下阻碍TCR识别,或者某些新抗原比其他新抗原更自我相似,因此 识别它们的t细胞克隆被中央耐受性所删除。探讨TCR结合是未来新表位预测研究的一个重要方向;然 而,我们的模型在TIL新表位数据集上的预测性能和预期的新抗原反应T细胞识别任务表明,尽管模拟 TCR结合可以提供额外的效益,但现在仅通过对HLA处理和表示建模,就可以获得治疗有用的新表位预 测。总之,这项工作提供了实用的肿瘤抗原鉴定从常规的病人样本,应该是有用的设计和评估未来的癌 症免疫疗法
翻译小组:
叶名琛、陈凯星、王俊豪、邓峻玮、倪昊辰、常彦琪、黄敬潼、李碧琪、黄子亮、陈志荣、郑凌伶