相信现在很多童鞋都已经分析过了10X单细胞数据了,scenic分析目前也是常见的一种个性化分析了,很多同学都分析过, 今天我们温故而知新,深入理解scenic的分析原理和方法,从根本上理解scenic分析得到的结果。
scenic发表于2017年10月,期刊是nature methods,很早了,印象里2017年10X单细胞都才刚兴起,文章在SCENIC:single-cell regulatory network inference and clustering,对于这个软件,我们现在就来参透一下吧
一、abstract
We present SCENIC , a computational method for simultaneous gene regulatory network reconstruction and cell-state identification from single-cell RNA -seq data (http://scenic. aertslab.org)(同时重建基因调控网络和细胞状态的识别). On a compendium of single-cell data from tumors and brain, we demonstrate that cis-regulatory analysis can be exploited to (被利用,被用来)guide the identification of transcription factors and cell states(细胞状态和转录因子的识别). SCENIC provides critical biological insights into the mechanisms driving cellular heterogeneity.
这里我们需要知道以下问题:这里的cell states是指什么,转录因子又是如何识别的?带着问题,我们来往下看看。
二、简介
The transcriptional state of a cell emerges from an underlying gene regulatory network (GRN) in which a limited number of transcription factors (TFs) and cofactors regulate each other and their downstream target genes(细胞的转录状态收到GRN的调控,而GRN是由有限的factor和cofactor相互调控并且影响下游的靶基因)。这个地方说白了就是调控元件的活性决定细胞转录状态。Recent advances in single-cell transcriptome profiling have provided exciting opportunities for high-resolution identification of transcriptional states and of transitions between states,后面举了例子,比如细胞分化。Statistical techniques and bioinformatics methods that are optimized for single-cell RNA-seq have led to new biological insights,but it is still unclear whether specific and robust GRNs underlying stable cell states can be determined(缺点就是稳定细胞状态的调控网络基础仍然不清楚)。This may indeed be challenging given that at the single-cell level, gene expression may be partially disconnected from the dynamics of TF inputs on account of stochastic variation of gene expression from transcriptional bursting and other sources。(鉴于在单细胞水平上,由于转录bursting和其他来源的基因表达的随机变化,基因表达可能与TF输入的动力学部分脱节,这确实是一个挑战。 A few methods have been developed that infer coexpression networks from single-cell RNAseq data, but these methods do not use regulatory sequence analysis to predict interactions between TFs and target genes。(说白了就是预测TFs与基因的关系)。
We reasoned that linking cis-regulatory sequences to single-cell
gene expression could overcome dropouts and technical variation and thus optimize the discovery and characterization of cell states(这一点从目前来看,这个软件想多了)。To this end, we developed single-cell regulatory network inference and clustering (SCENIC) to map GRNs and then identify stable cell states by evaluating the activity of the GRNs in each cell. The SCENIC workflow consists of three steps。
步骤
In the first step, sets of genes that are coexpressed with TFs are identified using GENIE3
(利用GENIE3来识别与TFs共表达的基因),Since the GENIE3 modules are only based on coexpression, they may include many false positives and indirect targets.(纯基于共表达,包含了很多假阳性),这个地方容易理解。但是需要注意以下,TFs这是已知的,在我的认知里面TFs作用与什么基因好像都是可以推导出来的,有一些疑问,我们往后分析看看。
Second,To identify putative direct-binding targets, each coexpression module is subjected to cis-regulatory motif analysis using RcisTarget,Only modules with significant motif enrichment of the correct upstream regulator are retained, and they are pruned to remove indirect targets lacking motif support,We refer to these processed modules as regulons(调节子)。
Third,As part of SCENIC, we developed the AUCell algorithm to score the activity of each regulon in each cell。(AUCell分析我之前分享过,文章在深入理解R包AUcell对于分析单细胞的作用),For a given regulon, comparing
AUCell scores across cells makes it possible to identify which cells have significantly higher subnetwork activity.依据AUCell得到的分数矩阵可以进行下游分析。
接下俩就是一些案例,我们了解一下即可。
首先小鼠脑:
This analysis provided 151 regulons—out of 1,046 initial coexpression modules—with significantly enriched motifs for the corresponding TFs (7% of the initial TFs).Scoring regulon activity for each cell revealed the expected cell types alongside a list of potential master regulators for each cell type。
这与一些其他的案例,就不过多展示了,相对好用一些吧。
三、methods,这里我们关注一些需要注意的地方
关于GENIE3:it trains random forest models predicting the expression of each gene in the data set and uses as input the expression of the TFs(看来输入的并不是原始的矩阵,而实经过训练后的数据)。The different models are then used to derive weights for the TFs, measuring their respective relevance for the prediction of the expression of each target gene(真的是只基于共表达)。The highest weights can be translated into TF-target regulatory links。Since GENIE3 uses random-forest regression, it has the added value of allowing complex coexpression relationships between a TF and its candidate targets(这是其缺点)。
但是这里输入和均一化的方法好像不是直接为10X单细胞所设计的,需要我们关注一下。
至于代码有R版本,在这里R版本scenic
个人建议使用python版本,在这里python版本pyscenic
个人的观点
首先TFs是已知的,而TF作用的motif也是已知的,而motif对应的靶基因也是已知的,为什么不能正推高表达的TFs对应的基因呢?而选择反向推理??不知道有没有道友可以回答这个问题。
生活很好,有你更好