SCENIC: Single-cell regulatory network inference and clustering
Abstract
SCENIC can be use to demonstrate genomic regulatory code which can be exploited to guide the identification of transcription factors and cell states.
Introduction
The transcriptional state of a cell emerges from an underlying gene regulatory network (GRN) in which a limited number of transcription factors and co-factors regulate each other and their downstream target genes.
One challenge of single cell transcriptome profile analysis is that at the single cell level, gene expression may be partially disconnected from the dynamics of transcription factor inputs due to stochastic variation of gene expression consecutive to, for example, transcriptional bursting
(基因的表达式动态的,不稳定的)
Transcriptional bursting, also known as transcriptional pulsing, is a fundamental property of genes in which transcription from DNA to RNA can occur in "bursts" or "pulses", which has been observed in diverse organisms, from bacteria to mammals
(from wikipedia)
SCENIC link genomic regulatory code to single-cell gene expression variation to overcome drop-outs and technical variation, and could optimize the discovery and characterization of cellular states
The SCENIC workflow consists of three steps
Use GENIE3 to identify sets of genes that are co-expressed with transcription factors
(GRNboost was also provided to build the co-expression network on bigger datasets)
In brief, it trains random forest models predicting the expression of each gene in the dataset, using as input the expression of the transcription factors.The different models are then used to derive weights for the transcription factors, measuring their respective relevance for the prediction of the expression of each target gene. The highest weights can be translated into TF-target regulatory links.
Input
The preferred expression values are gene-summarized counts (which might or might not use unique molecular identifiers, UMI). Other measurements, such as counts or transcripts per million (TPM) and FPKM/RPKM, are also accepted as input. However, note that some authors recommend avoiding within sample normalization (i.e. TPM) for co-expression analysis because they may induce artificial co-variation
output
The output of GENIE3 is a table with the genes, the potential regulators, and their
“importance measure” (IM), which represents the weight that the transcription factor (input gene) has in the prediction of the target.
Each gene-set was then split into positive- and negative- correlated targets (i.e. Spearman correlation between the TF and the potential target) to separate likely activated and repressed targets. Finally, only the gene-sets (TF co-expression modules) with at least 20 genes were kept for the following step
Being based only on co-expression, these modules may include many false positives and indirect targets. So RcisTarge was used.
Use RcisTarge to identify putative direct-binding targets from co-expression module
Only modules with significant motif enrichment of the correct upstream regulator are retained, and pruned to remove indirect target genes without motif support.
Reson:
Step1
It selects DNA motifs that are significantly over-represented in the surroundings of the transcription start site (TSS) of the genes in the gene-set.The motifs that are annotated to the corresponding TF and obtain a Normalized Enrichment Score (NES) > 3.0 are retained. For each species, two gene-motif rankings (10kb around the TSS or 500bp upstream the TSS), which determine the search space around the transcription start site were used.
Step2
For each motif and gene-set, RcisTarget predicts candidate target genes (i.e. genes in the gene-set that are ranked above the leading edge).
However, in the datasets we analyzed, these modules were less numerous and showed very low motif enrichment, suggesting that these are lower quality modules. For this reason, we finally decided to exclude the detection of direct repression from the workflow, and continue only with the positive-correlated targets.
Score the activity of each of these regulons in each cell with AUCell
The relative scores of each regulon across the cells allow identifying which cells have a significantly high sub-network activity. The resulting binary activity matrix can be used as a biological dimensionality reduction for downstream analyses. Since the regulon is scored as a whole, instead of only the TF or individual genes, this approach is robust against drop-outs.
The input to AUCell is a gene set, and the output the gene set “activity” (AUC) in each cell.
AUCell calculates the enrichment of the regulon as an area under the recovery curve (AUC) across the ranking of all genes in a particular cell, whereby genes are ranked by their expression value. This method is therefore independent of the gene expression units and the normalization procedure
Cell clustering based on GRNs
2 matrix
The cell-regulon activity is summarized in a matrix in which the columns represent the cells and the rows the regulons. In the binary regulon activity matrix, the coordinates of the matrix that correspond to active regulons in a given cell will contain a “1”, and “0” otherwise.
The equivalent matrix, containing the continuous AUC values for each cell- regulon, is normally referred to as the AUC activity matrix.
The binary activity matrix tends to highlight higher-order similarities across cells (and therefore, highly reduces batch effects and technical biases), on the other hand, the AUC matrix allows to observe more subtle changes.
Gene filtering
first filter
The total number of reads per gene, is meant to remove genes that are most likely unreliable and provide only noise. The specific value depends on the dataset.
second filter
The number of cells in which the gene is detected (e.g. >0 UMI, or >1 log2(TPM)), is to avoid that genes that are only expressed in one, or very few cells, gain a lot of weight if they happen to coincide in a given cell. In the workflow, we recommend to set a percentage lower than the smallest population of cells to be detected