hello,大家好,从今天开始,我们开始走上正轨,分享10X单细胞或者10X空间转录组的分析内容,今天我们分享的内容就是做细胞定义的分析策略,文章在Tutorial: guidelines for annotating single-cell transcriptomic maps using automated and manual methods,2021年6月发表于nature protocols,影响因子10分,还是我们一贯的思路,先分享文献内容,最后看看示例代码。
Abstract(我们总结一下)
We recommend a three-step workflow including automatic cell annotation (wherever possible), manual cell annotation and verification.(自动细胞注释(应该是软件),人工细胞注释和确认)。
Frequently encountered challenges are discussed, as well as strategies to address them.(希望对我们有所帮助)。
Guiding principles and specific recommendations for software tools and resources that can be used for each step are covered, and an R notebook is included to help run the recommended workflow.(看来作者已经写好了示例,当然,需要我们有一定的基础)。
Introduction
(1)细胞注释,我们单细胞分析得到的注释类似于下图,注释的结果必须是可解释的并且支持生物学的发现。
To interpret this map biologically, it is necessary to determine which cell types or cell states are represented by clusters or other patterns observed in the data。These interpretations can then be labeled on the map, which helps place them in a conceptual framework useful for better understanding tissue biology.
简单回顾一下三种降维的优缺点
-
t-SNE:preserves local groups of similar cells,while equalizing the density of cells within each group(同时均衡每组内的细胞密度);global relationships between cell types are not maintained, and thus
cluster-to-cluster relationships cannot be inferred and may be misleading。 - UMAP:UMAP is typically regarded as better for visualizing global relationships and gradients than t-SNE(这个大家应该熟悉)。
- PCA:PCA can be useful for visualizing cell gradients and states,线性降维
(2)细胞注释的一般步骤automatic annotation, manual annotation and verification,这一部分大家应该都很熟悉了。
First, automatic annotation uses a predefined set of ‘marker genes’ or reference single-cell data to identify and label individual cells or cell clusters by matching their gene expression patterns (signatures) to those of known cell types.(这个应该是所有做细胞定义的分析过程,当然,也很困难)。
A second major step is manual annotation, which involves studying genes and gene functions specific to each cell cluster or pattern to verify automatic cell annotations and identify novel cell types and states.(尤其存在新的细胞类型,更加难以确认)。
Finally, verification can confirm the identity and function of select cell types using independent methods, such as new validation experiments.接下来,喔们逐步分析看看。
Step 1: automatic cell annotation 两种方式,1、marker gene 2、参考集,各有利弊
A major challenge with automatic cell annotation is that many cell types do not have well-characterized gene expression signatures, resulting in incomplete or inaccurate labeling for some cells.Automated methods typically work better for major cell types and may not be able to effectively distinguish subtypes.(细胞数多的可靠性高)。自动细胞注释有助于快速识别已知细胞类型并突出显示未知细胞类型以供进一步探索。
Comparison of the caveats and recommendations for different approaches to cell annotation
Stage of analysis | Aspect of analysis | Potential caveats | Recommendation |
---|---|---|---|
Automatic cell annotation | All automatic methods | Fast, but not effective for poorly characterized cells | Use manual annotation for poorly characterized cells |
Annotating clusters | May miss important differences between cells | Use automatic annotation of clusters to get a general idea of cell type and then refine labels manually. In addition, use multiple cluster-based methods and compare results | |
Annotating individual cells | Ideal, but requires high reads per cell | Experiments with low reads per cell require cluster-based annotation | |
Marker-based annotation methods | Marker genes not easily accessible for all cell types; may result in conflicting or absent cell labels | Requires expert knowledge to curate more extensive marker lists | |
Reference-based annotation methods | Perform poorly with incomplete or poorly matched reference data, which may result in conflicting or absent cell labels | Use well-matched reference data or marker-based methods if such data are unavailable | |
Often requires batch correction, which may reduce the accuracy of results | Analyze the reference data for strong biological signals. Use a good experimental protocol that will prevail over batch effects | ||
Mistakes in reference data get carried over to results | Analyze reference data for potential errors before using | ||
Comparing results from different automatic annotation methods | Results may not agree with each other | Compare confidence scores of respective labels and consider label agreement (majority rule); resolve conflicts using manual annotation | |
Consider the possibility of cell subtypes, new cell types or gradients and cell states | |||
Expert manual cell annotation | All manual methods | Slow, labor-intensive | Whenever possible, begin with automatic annotation to determine general cell labels |
Subjective | Work with an expert; consider multiple cell-type conclusions | ||
Marker-based annotation | Cell types not distinguishable by a single marker | Use multiple markers for each cell type | |
Known markers not distinguishing cell types | Curate larger lists of markers from the literature, additional experiments or experts | ||
Conflicting marker gene sets between sources | Select a marker gene set that best represents the biological signal being looked for in the data (e.g., if looking for cell subtypes, use more extensive gene sets than what is used for general cell-type annotation) |
先来看看第一个,基于marker gene 的注释
To be successful, the marker gene or gene set (a collection of marker genes) should be specifically and consistently expressed in a given cell, cluster or class of cells。大而全的marker gene列表是必需的。
(1)To label individual cells(注意这里是定义单个细胞),one of the most reliable markerbased annotation tools is semi-supervised category identification and assignment(半监督的策略,SCINA,这里需要我们认为提供marker gene列表,原理大家需要看看原文章了)。
(2)AUCell is another good marker-based labeling method that classifies individual cells or clusters.(关于AUcell大家可以参考文章深入理解R包AUcell对于分析单细胞的作用)。AUCell ranks the genes in each cell by decreasing expression value, and cells are labeled according to their most active (highly expressed) marker gene sets.(原理,了解一下)
(3)gene set variation analysis (GSVA) has been benchmarked to be fast and reliable。GSVA works similarly to AUCell: given a database of marker gene sets, it identifies sets that are enriched in the gene expression profile of a cluster.(关于GSVA做细胞定义,这个是目前大多数公司的做法)。
marker 定义细胞类型的最大问题,A disadvantage of these tools is that markers are not easily accessible for all cell types.
软件注释细胞类型的表格
Tool | Type | Language | Resolution | Approach | Allows ‘None’ | Notes |
---|---|---|---|---|---|---|
singleCell Net | Reference based | R | Single cells | Relative-expression gene pairs + random forest | Yes, but rarely does so even when it should | 10–100× slower than other methods; high accuracy |
scmap-cluster | Reference based | R | Single cells | Consistent correlations | Yes | Fastest method available; balances falsepositives and false-negatives; includes web interface for use with a large pre-built reference or custom reference set |
scmap-cell | Reference based | R | Single cells | Approximate nearest neighbors | Yes | Assigns individual cells to nearest neighbor cells in reference; allows mapping of cell trajectories; fast and scalable |
singleR | Reference based | R | Single cells | Hierarchical clustering and Spearman correlations | No | Includes a large marker reference; does not scale to data sets of ≥10,000 cells; includes web interface with marker database |
Scikit-learn | Reference based | Python | Multiple possible | k-nearest neighbors, support vector machine, random forest, nearest mean classifier and linear discriminant analysis | (Optional) | Expertise required for correct design and appropriate training of classifier while avoiding overtraining |
AUCell | Marker based | R | Single cells | Area under the curve to estimate marker gene set enrichment | Yes | Because of low detection rates at the level of single cells, it requires many markers for every cell type |
SCINA | Marker based | R | Single cells | Expectation maximization, Gaussian mixture model | (Optional) | Simultaneously clusters and annotates cells; robust to the inclusion of incorrect marker genes |
GSEA/GSVA | Marker based | R/Java | Clusters of cells | Enrichment test | Yes | Marker gene lists must be reformatted in GMT format. Markers must all be differentially expressed in the same direction in the cluster |
Harmony | Integration | R | Single cells | Iterative clustering and adjustment | Yes | Integrates only lower-dimensional projection of the data; seamlessly(无缝地) integrated into Seurat pipeline; may overcorrect data |
Seurat-canonical correlation analysis | Integration | R | Single cells | MNN anchors + canonical correlation analysis | Yes | Accuracy depends on the accuracy of MNN anchors, which are automatically-identified corresponding cells across data sets |
mnnCorrect | Integration | R | Single cells | MNN pairs + singular value decomposition | Yes | Accuracy depends on the accuracy of MNN pairs (cells matched between data sets). |
Linked inference of genomic experimental relationships (LIGER) | Integration | R | Single cells | Non-negative matrix factorization | Yes | Allows interpretation of data set–specific and shared factors of variation |
接下来看看第二部分 Reference-based automatic cell annotation
this approach is possible only if high-quality and relevant annotated reference single-cell data are available。目前单细胞已经有了一些公共的数据库可以获取参考集。These atlases typically contain hundreds of thousands of cells and dozens of different annotated cell types。
这种方法有一个共同的特点,需要一个注释完整的参考集,一旦参考集是不完整的、缺失的,准确度就会明显的下降。
当然,原则上任何做整合分析的方法都可以用于细胞定义,简单回顾一下常见整合方法的特点。
- Harmony iteratively merges data sets represented by top PCs, which are then used to cluster cells. Each cell is iteratively adjusted on the basis of an estimated correction vector to shift it closer to the center of its cluster until convergence. MNN approaches, such as mnnCorrect/FastMNN or Seurat v321, identify the most similar cells (MNNs), called ‘anchors’, across data sets that are used to estimate and correct the cell type–specific batch effects. LIGER identifies shared (common biology) and unique (biological or technical) factors between data sets using non-negative matrix factorization. LIGER is recommended when specific cell types appear to be present in some of the data sets and missing in others. Integration methods can suffer from overcorrection, where different cell types are merged, or undercorrection, when resulting clusters contain cells from only one input data set. Multiple integration methods may need to be evaluated to find a balance that best represents the data.(批次矫正的方法也要根据情况来判断),过度矫正是目前常见的问题
细化自动注释
Benchmarking studies show variable performance of automatic annotation tools, depending on the data set and distinctiveness of the gene expression profiles of the cell types to be annotated(软件之间的也不具有统一性)。
For instance, distinguishing T cells from B cells is relatively straightforward, but automatic tools sometimes cannot accurately distinguish CD8+ cytotoxic T cells from natural killer cells(如下图)
When applying multiple cell annotation methods to a data set, cells or clusters will acquire multiple, sometimes conflicting, cell-type labels.(这也是最大的问题)。如果定义的一致,那么很容易辨别,容易定义结果存在矛盾,那么每个软件提供一个可能性分数to identify a single high-scoring label.但是不同软件的判定分数不具有比较性,但是可以根据定义结果出现的频率来加以识别,出现的频率越高,越可能是某种细胞类型,如果这些都不行,那就只有人工注释了。
如果注释存在矛盾,很可能说明该cluster还有subtype,但是,如果不能明确定义亚型,则更通用的细胞类型注释可能更合适。 这里举一个例子,For example, if a cluster is annotated as regulatory T cells, naive T cells and helper T cells by different methods, it may be most appropriate to assign the general label of ‘T cells’.
If the conflicting annotations are not subtypes of the same cell type, then the cluster may represent an intermediate cell state or gene-expression gradient。(中间状态或者基因表达等级,这个地方很值得挖掘)。
Finally, a cluster may have a novel cell identity that is absent from the reference data. This often results in widely varying results from automatic annotation methods or insufficient confidence for any tool to assign any label. In such situations, manual annotation must be performed.(新的细胞类型就需要我们人工注释了)。
Step 2: expert manual cell annotation(人工注释)
人工注释细胞类型目前是最可靠的方法(gold-standard method),但是,it is slow and labor intensive and can be subjective.主要的人工注释,就是我们人工选择marker的过程,很繁琐。
所有挑选出来的基因必须进行检验和可视化(比如dotplot和热图)
Challenges in this approach are that well-known markers are often too few in number to completely annotate an scRNA-seq data set, and some well-known markers may not be as specific within an scRNA-seq data set as expected.
set. Master transcription factors that drive cell fate often make better gene expression markers than cell-surface proteins that are commonly used to classify cell populations(转录因子基因的识别能力更好),因为转录组水平和蛋白水平并不不能很好的关联。
识别同一个细胞类型的marker gene通常是多个,尤其在定义subcluster的时候。
The ideal primary source for cell-defining genes is a singlecell atlas from a relevant organism, organ and disease context.(marker基因具有组织、器官、疾病特异性)。
in some instances a cluster may not express markers of any known cell type; conversely, it may express markers of more than one cell type.(这种情况就是低质量的细胞、新的细胞类型或者含有subcluster)。
Once cell-type information from known markers is exhausted, cells that have not been confidently annotated must be manually examined, cluster by cluster.(marker gene无法起到作用的时候,就需要人工进一步检验了。潜在的marker gene需要用一些软件来进行差异分析)。
All marker genes are then manually researched to find functional information that may help identify the cell type of the cluster with which they are associated。(例如通路富集)。
Some cells may be challenging to annotate, including novel cell types, which can be described on the basis of the function of genes they express。
Annotating cell states and gradients(针对新的细胞类型)
When analyzing and characterizing novel cell types, it is important to determine whether they represent a stable cell type or contain multiple cell states.(稳态还是多种细胞状态)。
细胞类型和状态的定义尚未标准化,但可能预期稳定的细胞类型在整个cluster中具有homo基因表达并且聚类在一起。
whereas cell gradients appear as a spread-out string of cells and cell states,Expression gradients indicate continuous differences that are present in the cell population, which could represent states like the cell cycle, immune activation, spatial patterning or transient developmental stages。识别有意义的细胞状态去除批次效应意义重大。
注释细胞发育的中间阶段通常很困难,因为这些区域很少表达独特的标记基因。 It is often easier to label the ends of a gradient and then characterize intermediate stages using the order of specific genes that mark these ends as increasing or decreasing across the gradient。
Extracting the cells in the gradient and performing principal component analysis (PCA) on them is often a useful visualization for gradients, because it preserves the large-scale distances between cells(有的轨迹分析软件就采用这样的策略)。
目前没有可以自动注释中间态的方法,不同细胞的层次只能人工识别,making use of known structure and celltype transitions relevant to the particular experiment。
Similarly, homogeneous or similar cell states or cell types are often difficult to annotate because they share many of the same marker genes。(这个时候就需要再分群分析了).
Very fine distinctions between highly similar cell types may not be visible transcriptionally and may be visible only in other genomic layers, such as chromatin state (assay for transposase-accessible chromatin using sequencing (ATAC-seq) and DNA methylation).(多组学识别细胞类型也是很重要的一点)。
Step 3: annotation verification
需要其他的分析辅助验证了,包括多组学,SC-ATAC,CNV等等。
最后,附上一张做细胞定义的软件总结
我们来看一下示例代码
1. Reference-based automatic annotation
Create the Reference
The first step in performing reference-based annotation is to select an annotated dataset to use as the reference. Here we will use one of the references created by the authors of SingleR and show how it can be used with other tools such as scmap.
Other reference datasets can be found in GEO (https://www.ncbi.nlm.nih.gov/geo/) or at a link provided by the authors of the reference dataset. However, to use a dataset as a reference you will need both the single-cell RNA sequencing data and the cell-type annotations. GEO does not require authors to provide the cell-type annotations of their data, so you may need to contact the authors directly to to get the annotations for some datasets.
# Set a random seed to ensure result reproducibility
set.seed(9742)
# Download singleR reference data for immune cells and save it as the variable "ref"
# The variable is a class called "Summarized Experiment"
# This will take a while
ref <- celldex::DatabaseImmuneCellExpressionData()
Next we need to reformat the data to ensure it is compatible with the tool we are using. We will be demonstrating scmap, which uses data formatted as a ‘SingleCellExperiment object’, and assumes by default that gene names are found in a column named ‘feature_symbol’ while the cell-type labels are in a column named ‘cell_type1’. In addition, scmap requires that you normalize and log-transform the reference data; this has already been done for the SingleR reference data so we skip those steps here.
# Assign cell-type labels in a column named "cell_type1"
colData(ref)$cell_type1 <- colData(ref)$label.fine
# Assign gene names in a column called "feature_symbol"
rowData(ref)$feature_symbol <- rownames(ref)
# Convert the data into a SingleCellExperiment object
ref_sce <- SingleCellExperiment::SingleCellExperiment(assays=list(logcounts=Matrix::Matrix(assays(ref)$logcounts)),
colData=colData(ref), rowData=rowData(ref))
Our reference data is ready to be used now. So lets process this data to build the index we will use to map our unlabeled data to. First, we select genes to use, which will be those deemed most informative by scmap after fitting a linear model to the gene expression by gene dropout distribution. Those which are most informative have high expression values and low % dropout rates across cells.
# Create scmap-cluster reference by first selecting the most informative features
ref_sce <- scmap::selectFeatures(ref_sce, suppress_plot=FALSE)
Your object does not contain counts() slot. Dropouts were calculated using logcounts() slot...
# Inspect the first 50 genes selected by scmap
rownames(ref_sce)[which(rowData(ref_sce)$scmap_features)][1:50]
# You can check and see how many genes were chosen by checking the length of the
# vector of gene names
length(rownames(ref_sce)[which(rowData(ref_sce)$scmap_features)])
[1] 500
Now we can see the genes that scmap has chosen to use. If there are key marker genes missing we can make sure they are included like this:
# Create a list of key markers that you want to use
my_key_markers = c("TRAC", "TRBC1", "TRBC2", "TRDC", "TRGC1", "TRGC2", "IGKC")
# Ensure markers are in the list of features used by scmap
rowData(ref_sce)$scmap_features[rownames(ref_sce) %in% my_key_markers] <- TRUE
# You can check and see if this added any genes by checking the length
# of the vector of gene names again
length(rownames(ref_sce)[which(rowData(ref_sce)$scmap_features)])
[1] 502
And we can remove genes that we think might be technical artefacts, such as mitochondria RNAs, like this:
# Create a list of mitochondrial genes from the dataset (genes that begin with "MT")
mt_genes <- rownames(ref_sce)[grep("^MT-", rownames(ref_sce))]
# Remove these genes from the features used by scmap
rowData(ref_sce)$scmap_features[rownames(ref_sce) %in% mt_genes] <- FALSE
# Check how many genes this is
length(rownames(ref_sce)[which(rowData(ref_sce)$scmap_features)])
[1] 495
# Extract the features and assign them to a new variable, "scmap_feature_genes"
scmap_feature_genes <- rownames(ref_sce)[which(rowData(ref_sce)$scmap_features)]
# Note that the number of genes/features is identical to what we just checked
length(scmap_feature_genes)
[1] 495
Now we build the reference profiles used in scmap-cluster, for cluster-based cell-type annotation. These profiles can be accessed and plotted from inside the SingleCellExperiment object as follows:
# Create reference profiles;
# Once reference profiles are generated the original data are
# not needed for scmap-cluster
ref_sce <- scmap::indexCluster(ref_sce)
# Visualize interesting features as a heatmap
# Reformat the data so that they can be used as input to ggplot2
cormat <- reshape2::melt(as.matrix(metadata(ref_sce)$scmap_cluster_index))
# Plot the data
ggplot2::ggplot(cormat, ggplot2::aes(x = Var2, y = Var1, fill = value)) +
ggplot2::geom_tile() +
ggplot2::scale_fill_gradient2(low = "blue", high = "darkred",
name = "Expression value") +
ggplot2::theme_minimal() +
ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 90, vjust = 1,
size = 18, hjust = 1),
axis.text.y = ggplot2::element_text(size = 15),
axis.title.x = ggplot2::element_blank(),
axis.title.y = ggplot2::element_blank())
# Store expression information as a variable
scmap_cluster_reference <- metadata(ref_sce)$scmap_cluster_index
From here on out, scmap only needs this set of reference profiles. So if working with a very large reference, one could save this index separately to your computer and reload it when annotating new datasets. But since that is not the case here, we will simply save this index to a variable for now.
We will also demonstrate scmap-cell to annotate individual cells of our dataset, so we will create that index as well. As before one would first normalize and log-transform the reference data, and select genes to use. As we have already done that, we need only run the command to build the scmap-cell index. There are two parameters we can set: M and k, increasing M and k will give more accurate mapping but increase the size of the index, and the time needed to map cells. Here we use the defaults (you may see a warning message about the defaults that are being used):
# Update the previous reference to also contain the scmap-cell reference
ref_sce <- scmap::indexCell(ref_sce)
Parameter M was not provided, will use M = n_features / 10 (if n_features <= 1000), where n_features is the number of selected features, and M = 100 otherwise.
Parameter k was not provided, will use k = sqrt(number_of_cells)
# Extract the scmap index from the reference and store as a variable
scmap_cell_reference <- metadata(ref_sce)$scmap_cell_index
# Extract the associated cell IDs from the reference and save as a variable
scmap_cell_metadata <- colData(ref_sce)
scmap-cell assigns cells in one dataset to their “nearest neighbours” in the reference dataset. In this case, the “nearest neighbours” are the cells in the reference dataset most similar to the cells in the query dataset.
One can use any rule they like to transfer information, such as cell-type or pseudotime, from these nearest neighbours to the query data. Thus we need to store the associated metadata (cell type ID) for the reference as well (see above). Now we don’t need to use our original reference dataset anymore.
Assign cells from the query dataset to the reference.
The query dataset we will be using is provided by 10X genomics.
download.file("https://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz",
"pbmc3k_filtered_gene_bc_matrices.tar.gz")
trying URL 'https://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz'
Content type 'application/x-tar' length 7621991 bytes (7.3 MB)
==================================================
downloaded 7.3 MB
untar("pbmc3k_filtered_gene_bc_matrices.tar.gz")
Now we need to load our unlabeled dataset into R. Normal preprocessing including QC filtering, normalizing and log-transforming the data must be done prior to annotating. In addition, scmap is based on the SingleCellExperiment object, so if our data is stored as a Seurat object we must convert it to SingleCellExperiment as shown below.
# This portion of the tutorial is assuming the raw 10X data is in the
# following folder in your directory:
data <- Seurat::Read10X("filtered_gene_bc_matrices/hg19/")
# Make SingleCellExperiment from the raw matrix
query_sce <- SingleCellExperiment::SingleCellExperiment(assays=list(counts=data))
# Make SingleCellExperiment from Seurat object
query_seur <- Seurat::CreateSeuratObject(data)
Feature names cannot have underscores ('_'), replacing with dashes ('-')
query_sce <- Seurat::as.SingleCellExperiment(query_seur)
# normalize the data using the scater package
query_sce <- scater::logNormCounts(query_sce)
# add feature_symbol column (i.e. the gene symbols)
rowData(query_sce)$feature_symbol <- rownames(query_sce)
Now you should have an entry in assays(my_sce)
called logcounts
with the log-normalized matrix. We are now ready to annotate our data with scmap-cluster. Let’s start with scmap-cluster:
# Run scmapCluster
scmap_cluster_res <- scmap::scmapCluster(projection=query_sce,
index_list=list(immune1 = scmap_cluster_reference),
threshold=0.1)
# plot the results of our annotation
par(mar=c(13, 4, 1, 0))
barplot(table(scmap_cluster_res$combined_labs), las=2)
# Store this annotation information within the query object
colData(query_sce)$scmap_cluster <- scmap_cluster_res$combined_labs
# Make a UMAP of the cells, labeled with the cell-type annotations from scmapCluster
query_sce <- scater::runUMAP(query_sce)
scater::plotReducedDim(query_sce, dimred="UMAP", colour_by="scmap_cluster")
Alternatively we could use scmap-cell, to find the 10 nearest neighbours to each cell (i.e. the 10 most similar cells to each query cell), then pick the annotation that is most common among the neighbours, like this:
# Determine the 10 nearest neighbours from the reference dataset for each
# cell in the query dataset using scmapCell
nearest_neighbours <- scmap::scmapCell(projection=query_sce,
index_list = list(immune1 = scmap_cell_reference),
w=10)
# Get metadata (cell type IDs) for the neighbours of each cell in the query dataset
mode_label <- function(neighbours, metadata=scmap_cell_metadata$cell_type1) {
freq <- table(metadata[neighbours])
label <- names(freq)[which(freq == max(freq))]
if (length(label) > 1) {return("ambiguous")}
return(label)
}
# Apply these labels to the query cells
scmap_cell_labs <- apply(nearest_neighbours$immune1$cells, 2, mode_label)
# Add the labels to the query object
colData(query_sce)$scmap_cell <- scmap_cell_labs
# Create a bar plot of how many cells in the query dataset were assigned
# a specific label
par(mar=c(10, 4, 0, 0))
barplot(table(scmap_cell_labs), las=2)
# Make a UMAP and add the new cell-type annotations
scater::plotReducedDim(query_sce, dimred="UMAP", colour_by="scmap_cell")
Another option compatible with the SingleCellExperiment Object is SingleR. As before, we need a reference and a query dataset. In the case of SingleR, we need the entirety of the reference dataset, rather than generating a compressed reference index as we did with scmap. In addition, running just this small example demonstrates the difference in run time between the methods (SingleR takes a fair bit of time).
# Run SingleR on the query data and the reference to acquire
# cell-type predictions for the cells in the query dataset
predictions <- SingleR::SingleR(test=query_sce, ref=ref, labels=ref$label.fine)
# You'll notice that some of the cells didn't get assigned a cell identity
# We can count the number here:
sum(is.na(predictions$pruned.labels))
[1] 23
# Change NAs to "ambiguous"
predictions$pruned.labels[which(is.na(predictions$pruned.labels))] <- "ambiguous"
# Add singleR labels to query_sce
colData(query_sce)$singleR <- predictions$pruned.labels
# Create a bar plot of number of cells per assigned cell ID
par(mar=c(13, 4, 2, 0))
barplot(table(predictions$pruned.labels), las=2)
# Make a UMAP and add the cell-type annotations
scater::plotReducedDim(query_sce, dimred="UMAP", colour_by="singleR")
Integration as a form of annotation
Another option is to integrate our query data with our reference data. Then we simply transfer the labels from the annotated reference to the neighbouring query cells in the integrated dataset. Clustering the integrated data is a common approach to transferring labels. We demonstrate how this could be done with Harmony below. But the approach would be the same for any integration tool.
Note: the SingleR reference is not single cells, but averages across many cells. Thus we convert and downsample the reference to a single cell object for demonstration purposes. For a real experiment, one would use the original single cells as the reference when integrating datasets.
set.seed(2891)
# Convert reference and query datasets to Seurat Objects
# Add a "counts" slot to the reference SingleCellExperiment object so we can convert it to a Seurat Object
assays(ref_sce)[["counts"]] <- round(2^assays(ref_sce)[["logcounts"]]) -1
colnames(ref_sce) <- paste("cell", 1:ncol(ref_sce))
# Subset both objects so both the reference and query datasets have the same genes
# First subset the reference
ref_seur <- Seurat::as.Seurat(ref_sce[rownames(ref_sce) %in% rownames(query_sce),])
ref_seur@active.ident <- factor(rep("reference", ncol(ref_seur)))
# Now subset the query
query_seur <- Seurat::as.Seurat(query_sce[rownames(query_seur) %in% rownames(ref_sce),])
query_seur@active.ident <- factor(rep("query", ncol(query_seur)))
# Downsample the reference to be similar to query in terms of total UMIs
totalUMI <- median(query_seur@meta.data$nCount_RNA)
ref_seur@assays$RNA@counts <- Seurat::SampleUMI(ref_seur@assays$RNA@counts,
max.umi=totalUMI, upsample=FALSE)
# Merge the datasets together into a single Seurat object
merged_seur <- merge(ref_seur, query_seur)
merged_seur@meta.data$source <- merged_seur@active.ident
# Normalize the combined data
merged_seur <- Seurat::NormalizeData(merged_seur)
# Rather than choosing new variable features, we will choose
# the genes that had been previously important by scmap for consistency
Seurat::VariableFeatures(merged_seur) <- scmap_feature_genes
# Scale the data and run dimensionality reduction on the combined data
merged_seur <- Seurat::ScaleData(merged_seur)
merged_seur <- Seurat::RunPCA(merged_seur)
merged_seur <- Seurat::RunUMAP(merged_seur, dims=1:15)
Seurat::DimPlot(merged_seur, reduction="umap") + ggplot2::ggtitle("Before Integration")
# Run Harmony to remove batch effects
merged_seur <- harmony::RunHarmony(merged_seur, "source", dims.use=1:15)
merged_seur <- Seurat::RunUMAP(merged_seur, dims=1:15, reduction="harmony")
# Plot the data
Seurat::DimPlot(merged_seur, reduction="umap") + ggplot2::ggtitle("After Integration")
Now that the data is integrated we will cluster the data and look at the annotations of the reference cells present in each cluster. As with all clustering, this may require manual tuning of the resolution parameters to get the best labels.
# Cluster the integrated dataset
merged_seur <- Seurat::FindNeighbors(merged_seur, reduction="harmony", dims=1:15)
merged_seur <- Seurat::FindClusters(merged_seur, resolution=0.5)
# Plot the data
Seurat::DimPlot(merged_seur, reduction="umap") + ggplot2::ggtitle("After Integration")
# Create a table of cluster labels based on integrated data
table(merged_seur@meta.data$label.fine,
merged_seur@active.ident)
0 1 2 3 4 5 6 7 8 9
B cells, naive 0 0 0 106 0 0 0 0 0 0
Monocytes, CD14+ 0 0 106 0 0 0 0 0 0 0
Monocytes, CD16+ 0 0 0 0 0 0 105 0 0 0
NK cells 0 105 0 0 0 0 0 0 0 0
T cells, CD4+, memory TREG 0 0 0 0 0 0 0 104 0 0
T cells, CD4+, naive 4 0 0 0 95 4 0 0 0 0
T cells, CD4+, naive TREG 1 0 0 0 102 0 0 1 0 0
T cells, CD4+, naive, stimulated 0 0 0 0 0 0 0 0 102 0
T cells, CD4+, TFH 96 0 0 0 5 0 0 3 0 0
T cells, CD4+, Th1 103 0 0 0 1 0 0 0 0 0
T cells, CD4+, Th1_17 104 0 0 0 0 0 0 0 0 0
T cells, CD4+, Th17 99 0 0 0 0 0 0 5 0 0
T cells, CD4+, Th2 96 0 0 0 6 0 0 2 0 0
T cells, CD8+, naive 0 0 0 0 1 103 0 0 0 0
T cells, CD8+, naive, stimulated 0 0 0 0 0 1 0 0 0 101
Here we have a table of the reference annotations (across rows) per cluster (across columns). We can manually label the clusters based on this table or we could create a rule to algorithmically label the clusters based on this table. Since there are only 11 clusters, we assign the labels manually.
cluster_labs <- c("0"="ambiguous",
"1"="Monocytes, CD14+",
"2"="B cells, naive",
"3"="T cells, CD4+, naive TREG",
"4"="T cells, CD4+, Th1_17",
"5"="NK cells",
"6"="T cells, CD8+, naive",
"7"="Monocytes, CD16+",
"8"="T cells, CD4+, memory TREG",
"9"="T cells, CD4+, naive, stimulated",
"10" = "T cells, CD8+, naive, stimulated")
# Assign cluster label to the associated query cells
# (the query cells that had been assigned the same cluster label)
merged_seur@meta.data$annotation <- cluster_labs[merged_seur@meta.data$RNA_snn_res.0.5]
# Add the results to the SingleCellExperiment Object and plot
query_sce$Harmony_lab <- merged_seur@meta.data$annotation[merged_seur@meta.data$source =="query"]
scater::plotReducedDim(query_sce, dimred="UMAP", colour_by="Harmony_lab")
2. Refining / Consensus annotations
Once we have run several tools, we can use the consensus of the labels to get a more robust annotation. In this case we will simply use the most common label across tools to assign the final automatically annotated label.
Hide
annotation_columns <- c("scmap_cluster", "scmap_cell", "singleR", "Harmony_lab")
#Optional check how consistent the labelling was.
#head(colData(query_sce)[,annotation_columns])
get_consensus_label <- function(labels){
labels <- labels[labels != "ambiguous"]
if (length(labels) == 0) {return("ambiguous")}
freq <- table(labels)
label <- names(freq)[which(freq == max(freq))]
if (length(label) > 1) {return("ambiguous")}
return(label)
}
colData(query_sce)$consensus_lab <- apply(colData(query_sce)[,annotation_columns], 1, get_consensus_label)
scater::plotReducedDim(query_sce, dimred="UMAP", colour_by="consensus_lab")
3. Marker-based automatic annotation
An alternative way for annotation of your query scRNAseq dataset is to utilize Marker-based annotation tools. SCINA is a semi-supervised annotation tool that takes in the signature genes and expression matrix and predicts the potential labels based on the prior knowledge of the cell-type-specific markers. List of markers is usually provided in the gmt format. The PBMC gene set used below have been gathered by Diaz-Mejia JJ et al.
download.file("https://zenodo.org/record/3369934/files/pbmc_22_10x.tar.bz2",
"pbmc_22_10x.tar.bz2")
untar("pbmc_22_10x.tar.bz2")
The extracted data will by located in the following file: ./MY_PAPER/SUPPLEMENTARY_DATA/pbmc_22_10x/pbmc_22_10x_cell_type_signature_gene_sets.gmt
The results from this annotation tool are not used in the above step to find consensus annotations because the lists of marker genes are not consistent with the cell types identified in the reference dataset. This is because these data come from different sources, and would not have been characterizing the exact same set of cells. If you wish for marker-based and reference-based annotation methods to be combined in the above step of automatically determining consensus annotations, you would have to make sure all of the identified cell subtypes are the same and that they are spelt the exact same way in order for R to recognize the names as identical.
# Import the marker genes as a GMT file and store as a variable
markers <- msigdb::read.gmt('./MY_PAPER/SUPPLEMENTARY_DATA/pbmc_22_10x/pbmc_22_10x_cell_type_signature_gene_sets.gmt')
# Convert the expression data from Seurat object into a matrix data structure
exprMatrix <- as.matrix(Seurat::GetAssayData(query_seur))
# Run SCINA on the query data using the marker genes to identify cell types
# Specifying rm_overlap = FALSE allows the same marker gene to specify multiple cell types which
# may be useful if identifying cell subtypes or other similar types of cells
# Specifying allow_unknown = TRUE allows cells to be labeled as "unknown" instead of being
# assigned a low-confident label
predictions.scina = SCINA::SCINA(exp = exprMatrix, signatures = markers$genesets,
rm_overlap = FALSE, allow_unknown = TRUE)
# Add SCINA annotation information to each cell in Seurat object
colData(query_sce)$SCINA <- predictions.scina$cell_labels
# Make a UMAP and add the SCINA cell-type annotations
scater::plotReducedDim(query_sce, dimred="UMAP", colour_by="SCINA") +
ggplot2::theme(legend.position = "bottom",
legend.text = ggplot2::element_text(size = 4))
4. Manual annotation
Retrieving marker genes
If you do not have an extensive list of markers per cell type, or a good quality reference dataset, it is useful to extract the top marker genes from each cluster of your query data. We can easily do this in Seurat, with the data formatted as a *Seurat object (which we created earlier and stored as the variable query_seur
). First, the data must be normalized and scaled, and the variable genes between cells must be determined.
Hide
query_seur <- Seurat::NormalizeData(query_seur) # Normalize the data
Performing log-normalization
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
query_seur <- Seurat::FindVariableFeatures(query_seur) # Determine the variable features of the dataset
Calculating gene variances
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Calculating feature variances of standardized and clipped values
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
query_seur <- Seurat::ScaleData(query_seur) # Scale the data based on the variable features
Next, different types of dimensionality reduction must be performed on the data so that the cells can be grouped together in 2D space.
query_seur <- Seurat::RunPCA(query_seur)
query_seur <- Seurat::RunTSNE(query_seur)
# RunUMAP has already been performed on the data, so the following line of code
# does not need to be run in this case:
#query_seur <- Seurat::RunUMAP(query_seur, dims = 1:50)
From this object, we can cluster the data at a chosen resolution that can be modified later on if desired.
# Determine the "nearest neighbours" of each cell
query_seur <- Seurat::FindNeighbors(query_seur, dims = 1:50)
Computing nearest neighbor graph
Computing SNN
# Cluster the cells
query_seur <- Seurat::FindClusters(query_seur, resolution = 0.5)
Seurat::DimPlot(query_seur, reduction = "UMAP")
Now let’s extract the top marker genes, and see which ones correspond with each cluster. This can be done using the FindAllMarkers function within Seurat.
markers_seur <- Seurat::FindAllMarkers(query_seur, only.pos = TRUE)
require(dplyr)
# Retrieve the top 5 marker genes per cluster
# Use whichever genes have the highest values under the AVG_LOG column
top5 <- markers_seur %>% group_by(cluster) %>%
dplyr::slice_max(get(grep("^avg_log", colnames(markers_seur), value = TRUE)),
n = 5)
# Create the dot plot
Seurat::DotPlot(query_seur, features = unique(top5$gene)) +
ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 90, vjust = 1,
size = 8, hjust = 1)) +
Seurat::NoLegend()
# Create the heatmap
Seurat::DoHeatmap(query_seur, features = unique(top5$gene)) +
Seurat::NoLegend() +
ggplot2::theme(axis.text.y = ggplot2::element_text(size = 8))
The following features were omitted as they were not found in the scale.data slot for the RNA assay: NOSIP, CD3E, CD3D, IL7R, LDHB
Pathway analysis
Pathway analysis can also be done for each cluster to determine significantly up- and downregulated pathways based on known gene function. An easy way to do this is by feeding our current Seurat object into cerebroApp. cerebroApp requires that marker genes be fetched again through before performing simple pathway analysis.
# First get marker genes through cerebro
query_seur <- cerebroApp::getMarkerGenes(query_seur,
groups = c('seurat_clusters'),
assay = "RNA",
organism = "hg")
# Get enriched pathways through cerebro
query_seur <- cerebroApp::getEnrichedPathways(query_seur,
databases = c("GO_Biological_Process_2018",
"GO_Cellular_Component_2018",
"GO_Molecular_Function_2018",
"KEGG_2016",
"WikiPathways_2016",
"Reactome_2016",
"Panther_2016",
"Human_Gene_Atlas",
"Mouse_Gene_Atlas"),
adj_p_cutoff = 0.05,
max_terms = 100,
URL_API = "http://amp.pharm.mssm.edu/Enrichr/enrich")
生活很好,有你更好