Abstract-1
Protein complexes are key units for studying a cell system. During the past decades, the genome-scale protein–protein interaction (PPI) data have been determined by high-throughput approaches, which enables the identification of protein complexes from PPI networks. However, the high-throughput approaches often produce considerable fraction of false positive and negative samples. In this study, we propose the mutual important interacting partner relation to reflect the co-complex relationship of two proteins based on their interaction neighborhoods. In addition, a new algorithm called idenPC-MIIP is developed to identify protein complexes from weighted PPI networks. The experimental results on two widely used datasets show that idenPC-MIIP outperforms 17 state-of-the-art methods, especially for identification of small protein complexes with only two or three proteins.
Brief:预测蛋白质复合物,目前主要基于PPI网络评估,但由于PPI网络存在大量假阳性和假阴性,评估效果差。整体流程:1先从酵母蛋白数据库中构建无偏PPI,随后基于 GOSemSim评估蛋白间的语义相似性,从而为矩阵进行加权。通过对每个节点交互阈值来判断节点间是否存在互相作用(MIIP)。2将MIIP聚类,并以其中节点degree最大的node作为种子,并经过设定阈值对聚类进行拓展。3对蛋白聚类进行相似性评估,相似度超过0.8的则被认为可能组成蛋白复合体
Abstract-2
Breast cancer is one of the most human malignant diseases and the leading cause of cancer-related death in the world. However, the prognostic and therapeutic benefits of breast cancer patients cannot be predicted accurately by the current stratifying system. In this study, an immune-related prognostic score was established in 22 breast cancer cohorts with a total of 6415 samples. An extensive immunogenomic analysis was conducted to explore the relationships between immune score, prognostic significance, infiltrating immune cells, cancer genotypes and potential immune escape mechanisms. Our analysis revealed that this immune score was a promising biomarker for estimating overall survival in breast cancer. This immune score was associated with important immunophenotypic factors, such as immune escape and mutation load. Further analysis revealed that patients with high immune scores exhibited therapeutic benefits from chemotherapy and immunotherapy. Based on these results, we can conclude that this immune score may be a useful tool for overall survival prediction and treatment guidance for patients with breast cancer.
Brief:基于ssGSEA及收集的免疫相关基因集定义乳腺癌亚型,揭示乳腺癌免疫亚型与突变及免疫治疗有效性的联系
Abstract-3
Identification of new drug–target interactions (DTIs) is an important but a time-consuming and costly step in drug discovery. In recent years, to mitigate these drawbacks, researchers have sought to identify DTIs using computational approaches. However, most existing methods construct drug networks and target networks separately, and then predict novel DTIs based on known associations between the drugs and targets without accounting for associations between drug–protein pairs (DPPs). To incorporate the associations between DPPs into DTI modeling, we built a DPP network based on multiple drugs and proteins in which DPPs are the nodes and the associations between DPPs are the edges of the network. We then propose a novel learning-based framework, ‘graph convolutional network (GCN)-DTI’, for DTI identification. The model first uses a graph convolutional network to learn the features for each DPP. Second, using the feature representation as an input, it uses a deep neural network to predict the final label. The results of our analysis show that the proposed framework outperforms some state-of-the-art approaches by a large margin.
Brief:1基于现有数据库将一对药物基因关系作为node构建药物蛋白网络,权重基于一定关系赋值为强连接,弱连接及无关。2将蛋白序列及药物结构信息作为特征3基于图卷积计算网络中每个节点特征再基于神经网络预测蛋白药物作用
Abstract-4
The progression of cancer is accompanied by the acquisition of stemness features. Many stemness evaluation methods based on transcriptional profiles have been presented to reveal the relationship between stemness and cancer. However, instead of absolute stemness index values—the values with certain range—these methods gave the values without range, which makes them unable to intuitively evaluate the stemness. Besides, these indices were based on the absolute expression values of genes, which were found to be seriously influenced by batch effects and the composition of samples in the dataset. Recently, we have showed that the signatures based on the relative expression orderings (REOs) of gene pairs within a sample were highly robust against these factors, which makes that the REO-based signatures have been stably applied in the evaluations of the continuous scores with certain range. Here, we provided an absolute REO-based stemness index to evaluate the stemness. We found that this stemness index had higher correlation with the culture time of the differentiated stem cells than the previous stemness index. When applied to the cancer and normal tissue samples, the stemness index showed its significant difference between cancers and normal tissues and its ability to reveal the intratumor heterogeneity at stemness level. Importantly, higher stemness index was associated with poorer prognosis and greater oncogenic dedifferentiation reflected by histological grade. All results showed the capability of the REO-based stemness index to assist the assignment of tumor grade and its potential therapeutic and diagnostic implications.
Brief:前期研究不足,数据集间干性评估稳定性较差。
LIN28B Functioning effectively in reprogramming to pluripotency
Abstract-5
Messenger RNAs (mRNAs) shoulder special responsibilities that transmit genetic code from DNA to discrete locations in the cytoplasm. The locating process of mRNA might provide spatial and temporal regulation of mRNA and protein functions. The situ hybridization and quantitative transcriptomics analysis could provide detail information about mRNA subcellular localization; however, they are time consuming and expensive. It is highly desired to develop computational tools for timely and effectively predicting mRNA subcellular location. In this work, by using binomial distribution and one-way analysis of variance, the optimal nonamer composition was obtained to represent mRNA sequences. Subsequently, a predictor based on support vector machine was developed to identify the mRNA subcellular localization. In 5-fold cross-validation, results showed that the accuracy is 90.12% for Homo sapiens (H. sapiens). The predictor may provide a reference for the study of mRNA localization mechanisms and mRNA translocation strategies. An online web server was established based on our models, which is available at http://lin-group.cn/server/iLoc-mRNA/.
Brief:基于mRNA序列通过SVM识别mRNA亚细胞定位
Abstract-6
Numerous studies have shown that copy number variation (CNV) in lncRNA regions play critical roles in the initiation and progression of cancer. However, our knowledge about their functionalities is still limited. Here, we firstly provided a computational method to identify lncRNAs with copy number variation (lncRNAs-CNV) and their driving transcriptional perturbed subpathways by integrating multidimensional omics data of cancer. The high reliability and accuracy of our method have been demonstrated. Then, the method was applied to 14 cancer types, and a comprehensive characterization and analysis was performed. LncRNAs-CNV had high specificity in cancers, and those with high CNV level may perturb broad biological functions. Some core subpathways and cancer hallmarks widely perturbed by lncRNAs-CNV were revealed. Moreover, subpathways highlighted the functional diversity of lncRNAs-CNV in various cancers. Survival analysis indicated that functional lncRNAs-CNV could be candidate prognostic biomarkers for clinical applications, such as ST7-AS1, CDKN2B-AS1 and EGFR-AS1. In addition, cascade responses and a functional crosstalk model among lncRNAs-CNV, impacted genes, driving subpathways and cancer hallmarks were proposed for understanding the driving mechanism of lncRNAs-CNV. Finally, we developed a user-friendly web interface-LncCASE (http://bio-bigdata.hrbmu.edu.cn/LncCASE/) for exploring lncRNAs-CNV and their driving subpathways in various cancer types. Our study identified and systematically characterized lncRNAs-CNV and their driving subpathways and presented valuable resources for investigating the functionalities of non-coding variations and the mechanisms of tumorigenesis.
Brief:表征了泛癌水平的lncRNAs-CNV,lnc-RNA-CNV影响了癌症中的关键通路
Abstract-7
Single-cell RNA sequencing allows us to study cell heterogeneity at an unprecedented cell-level resolution and identify known and new cell populations. Current cell labeling pipeline uses unsupervised clustering and assigns labels to clusters by manual inspection. However, this pipeline does not utilize available gold-standard labels because there are usually too few of them to be useful to most computational methods. This article aims to facilitate cell labeling with a semi-supervised method in an alternative pipeline, in which a few gold-standard labels are first identified and then extended to the rest of the cells computationally.We built a semi-supervised dimensionality reduction method, a network-enhanced autoencoder (netAE). Tested on three public datasets, netAE outperforms various dimensionality reduction baselines and achieves satisfactory classification accuracy even when the labeled set is very small, without disrupting the similarity structure of the original space.
Brief:现有两种注释方法:cluster后依据marker人工定义及基于金标准数据集的训练后注释。文章基于自编码器开发了一种半监督的降维方法,算法优势降维空间在样本压缩到低维时尽可能保留多的信息,又要有效,表现出较强的聚类结构,易于分类
Abstract-8
Evidence has shown that microRNAs, one type of small biomolecule, regulate the expression level of genes and play an important role in the development or treatment of diseases. Drugs, as important chemical compounds, can interact with microRNAs and change their functions. The experimental identification of microRNA–drug interactions is time-consuming and expensive. Therefore, it is appealing to develop effective computational approaches for predicting microRNA–drug interactions.In this study, a matrix factorization-based method, called the microRNA–drug interaction prediction approach (MDIPA), is proposed for predicting unknown interactions among microRNAs and drugs. Specifically, MDIPA utilizes experimentally validated interactions between drugs and microRNAs, drug similarity and microRNA similarity to predict undiscovered interactions. A path-based microRNA similarity matrix is constructed, while the structural information of drugs is used to establish a drug similarity matrix. To evaluate its performance, our MDIPA is compared with four state-of-the-art prediction methods with an independent dataset and cross-validation. The results of both evaluation methods confirm the superior performance of MDIPA over other methods. Finally, the results of molecular docking in a case study with breast cancer confirm the efficacy of our approach. In conclusion, MDIPA can be effective in predicting potential microRNA–drug interactions.
Brief:首先基于已知的miRNA及药物构建矩阵,已知为1,未知为0。根据文章定义的公式,通过miRNA及药物相似性矩阵填补矩阵,再基于NMF将矩阵拆分后再相乘最终获得收敛过的miRNA及药物相关性矩阵
Abstract-9
Gene Set Enrichment Analysis (GSEA) is an algorithm widely used to identify statistically enriched gene sets in transcriptomic data. However, GSEA cannot examine the enrichment of two gene sets or pathways relative to one another. Here we present Differential Gene Set Enrichment Analysis (DGSEA), an adaptation of GSEA that quantifies the relative enrichment of two gene sets.After validating the method using synthetic data, we demonstrate that DGSEA accurately captures the hypoxia-induced coordinated upregulation of glycolysis and downregulation of oxidative phosphorylation. We also show that DGSEA is more predictive than GSEA of the metabolic state of cancer cell lines, including lactate secretion and intracellular concentrations of lactate and AMP. Finally, we demonstrate the application of DGSEA to generate hypotheses about differential metabolic pathway activity in cellular senescence. Together, these data demonstrate that DGSEA is a novel tool to examine the relative enrichment of gene sets in transcriptomic data.
Brief:DGSEA,基于GSEA算法,将其改为检测两个基因集之间差异的的算法
Abstract-10
Accurately predicting the risk of cancer patients is a central challenge for clinical cancer research. For high-dimensional gene expression data, Cox proportional hazard model with the least absolute shrinkage and selection operator for variable selection (Lasso-Cox) is one of the most popular feature selection and risk prediction algorithms. However, the Lasso-Cox model treats all genes equally, ignoring the biological characteristics of the genes themselves. This often encounters the problem of poor prognostic performance on independent datasets.Here, we propose a Reweighted Lasso-Cox (RLasso-Cox) model to ameliorate this problem by integrating gene interaction information. It is based on the hypothesis that topologically important genes in the gene interaction network tend to have stable expression changes. We used random walk to evaluate the topological weight of genes, and then highlighted topologically important genes to improve the generalization ability of the RLasso-Cox model. Experiments on datasets of three cancer types showed that the RLasso-Cox model improves the prognostic accuracy and robustness compared with the Lasso-Cox model and several existing network-based methods. More importantly, the RLasso-Cox model has the advantage of identifying small gene sets with high prognostic performance on independent datasets, which may play an important role in identifying robust survival biomarkers for various cancer types.
Brief:基于基因互作网络中拓扑系数更强的基因具有稳定的表达变化。通过在已知的基因互作网络中进行随机漫步来评估基因的拓扑系数,从而在cox回归中引入基因拓扑权重从而提高其泛化性
Abstract-11
Population studies such as genome-wide association study have identified a variety of genomic variants associated with human diseases. To further understand potential mechanisms of disease variants, recent statistical methods associate functional omic data (e.g. gene expression) with genotype and phenotype and link variants to individual genes. However, how to interpret molecular mechanisms from such associations, especially across omics, is still challenging. To address this problem, we developed an interpretable deep learning method, Varmole, to simultaneously reveal genomic functions and mechanisms while predicting phenotype from genotype. In particular, Varmole embeds multi-omic networks into a deep neural network architecture and prioritizes variants, genes and regulatory linkages via biological drop-connect without needing prior feature selections.
Brief:提出一种在群体水平中基因分型和基因表达数据预测疾病表型的学习算法
Abstract-12
we developed an interpretable and scalable machine learning model, ECMarker, to predict gene expression biomarkers for disease phenotypes and simultaneously reveal underlying regulatory mechanisms. Particularly, ECMarker is built on the integration of semi- and discriminative-restricted Boltzmann machines, a neural network model for classification allowing lateral connections at the input gene layer. This interpretable model is scalable without needing any prior feature selection and enables directly modeling and prioritizing genes and revealing potential gene networks (from lateral connections) for the phenotypes. With application to the gene expression data of non-small-cell lung cancer patients, we found that ECMarker not only achieved a relatively high accuracy for predicting cancer stages but also identified the biomarker genes and gene networks implying the regulatory mechanisms in the lung cancer development. In addition, ECMarker demonstrates clinical interpretability as its prioritized biomarker genes can predict survival rates of early lung cancer patients (P-value < 0.005). Finally, we identified a number of drugs currently in clinical use for late stages or other cancers with effects on these early lung cancer biomarkers, suggesting potential novel candidates on early cancer medicine.
Brief:主要包括三部分:1基于半受限和判别玻尔兹曼机在群体水平鉴定疾病表型;2对每个表型中有贡献的基因基于神经网络连通度进行排序,并鉴定相关基因调控网络;3相关基因的生存及功能分析
Abstract-13
CircRNAs are an abundant class of non-coding RNAs with widespread, cell-/tissue-specific patterns. Previous work suggested that epigenetic features might be related to circRNA expression. However, the contribution of epigenetic changes to circRNA expression has not been investigated systematically. Here, we built a machine learning framework named CIRCScan, to predict circRNA expression in various cell lines based on the sequence and epigenetic features.The predicted accuracy of the expression status models was high with area under the curve of receiver operating characteristic (ROC) values of 0.89–0.92 and the false-positive rates of 0.17–0.25. Predicted expressed circRNAs were further validated by RNA-seq data. The performance of expression-level prediction models was also good with normalized root-mean-square errors of 0.28–0.30 and Pearson’s correlation coefficient r over 0.4 in all cell lines, along with Spearman's correlation coefficient ρ of 0.33–0.46. Noteworthy, H3K79me2 was highly ranked in modeling both circRNA expression status and levels across different cells. Further analysis in additional nine cell lines demonstrated a significant enrichment of H3K79me2 in circRNA flanking intron regions, supporting the potential involvement of H3K79me2 in circRNA expression regulation.
Brief:基于序列及表观遗传特征预测cirRNA在不同细胞系中的表达
Abstract-14
With the reduction in price of next-generation sequencing technologies, gene expression profiling using RNA-seq has increased the scope of sequencing experiments to include more complex designs, such as designs involving repeated measures. In such designs, RNA samples are extracted from each experimental unit at multiple time points. The read counts that result from RNA sequencing of the samples extracted from the same experimental unit tend to be temporally correlated. Although there are many methods for RNA-seq differential expression analysis, existing methods do not properly account for within-unit correlations that arise in repeated-measures designs.We address this shortcoming by using normalized log-transformed counts and associated precision weights in a general linear model pipeline with continuous autoregressive structure to account for the correlation among observations within each experimental unit. We then utilize parametric bootstrap to conduct differential expression inference. Simulation studies show the advantages of our method over alternatives that do not account for the correlation among observations within experimental units.
Brief:时间序列RNA测序时,RNAseq的技术结果往往是和时间相关的,但现有差异分析方法无法揭示重复测量设计中的单位内相关性
Abstract-15
We describe a new iteration of ICGS that outperforms state-of-the-art scRNA-Seq detection workflows when applied to well-established benchmarks. This approach combines multiple complementary subtype detection methods (HOPACH, sparse non-negative matrix factorization, cluster ‘fitness’, support vector machine) to resolve rare and common cell-states, while minimizing differences due to donor or batch effects. Using data from multiple cell atlases, we show that the PageRank algorithm effectively downsamples ultra-large scRNA-Seq datasets, without losing extremely rare or transcriptionally similar yet distinct cell types and while recovering novel transcriptionally distinct cell populations. We believe this new approach holds tremendous promise in reproducibly resolving hidden cell populations in complex datasets.
Brief:工作流:1若数据集超过2500个细胞,则根据数据集大小基于PageRank/Louvain-based downsampling向下采样;2通过对变异大的基因间进行相关性系数计算来识别基因模块,在计算前排除细胞周期相关基因;3基于NMF对数据进行降维,其中K的值由显著差异基因的数目确定;4.对每个定义的NMF基因聚类,对每个基因与cluster的metadata进行相关性计算,每组的Top60及相关性系数大于0.3的基因被认为是属于该组的特征基因;5对输入的细胞及选择的特征基因进行SVM建模,对所有细胞基于该模型进行分类
Brief:ncRNA的检测pipeline
Abstract-18
Although there has been great progress in cancer treatment, cancer remains a serious health threat to humans because of the lack of biomarkers for diagnosis, especially for early-stage diagnosis. In this study, we comprehensively surveyed the specifically expressed genes (SEGs) using the SEGtool based on the big data of gene expression from the The Cancer Genome Atlas (TCGA) and the Genotype–Tissue Expression (GTEx) projects. In 15 solid tumors, we identified 233 cancer-specific SEGs (cSEGs), which were specifically expressed in only one cancer and showed great potential to be diagnostic biomarkers. Among them, three cSEGs (OGDH, MUDENG and ACO2) had a sample frequency >80% in kidney cancer, suggesting their high sensitivity. Furthermore, we identified 254 cSEGs as early-stage diagnostic biomarkers across 17 cancers. A two-gene combination strategy was applied to improve the sensitivity of diagnostic biomarkers, and hundreds of two-gene combinations were identified with high frequency. We also observed that 13 SEGs were targets of various drugs and nearly half of these drugs may be repurposed to treat cancers with SEGs as their targets. Several SEGs were regulated by specific transcription factors in the corresponding cancer, and 39 cSEGs were prognosis-related genes in 7 cancers. This work provides a survey of cancer biomarkers for diagnosis and early diagnosis and new insights to drug repurposing. These biomarkers may have great potential in cancer research and application.
Brief:特异表达基因(ESGs)是一种在少数几个组织中会表达的基因,具有组织特异性,可能成为癌症诊断的标志物,对泛癌的癌症相关特异性表达基因进行研究,发现了一些癌症早期诊断基因(即在T1期高表达)。两两组合的癌症相关特异基因表达异常对癌症的诊断能力更强