标准工作流程
在这个例子工作流中,我们演示了我们最近在论文中引入的两种新方法,Comprehensive Integration of Single Cell Data:
- 将多个不同的scna -seq数据集组装到一个参考数据集中
- 将细胞类型标签从参考数据集转移到新的查询数据集
在本例中,我们选择了通过四种技术(CelSeq (GSE81076) CelSeq2 (GSE85241)、Fluidigm C1 (GSE86469)和SMART-Seq2 (E-MTAB-5061)生成的人类胰岛细胞数据集。为了方便起见,我们通过SeuratData包分发这个数据集。
新方法的代码在Seurat v3中实现。您可以使用install.packages从CRAN下载和安装。
除了新的方法之外,Seurat v3还包含了许多旨在改进Seurat对象和用户交互的改进。为了帮助用户熟悉这些更改,我们为常见任务编写了一个命令备忘单( command cheat sheet)。
载入包和数据:
library(Seurat)
packageVersion("Seurat")
>[1] ‘3.1.0’
library(SeuratData)
#InstallData("pbmc3k")
#InstallData("panc8")
#install.packages("C:/Users/Administrator/Desktop/panc8.SeuratData_3.0.2.tar.gz", repos = NULL, type = "source")
#install.packages("C:/Users/Administrator/Desktop/pbmcsca.SeuratData_3.0.0.tar.gz", repos = NULL, type = "source")
> library(panc8.SeuratData)
> data("panc8")
> panc8
An object of class Seurat
34363 features across 14890 samples within 1 assay
Active assay: RNA (34363 features)
> summary(panc8)
Length Class Mode
1 Seurat S4
> citation('panc8.SeuratData')
在出版物中使用程序包时引用‘panc8.SeuratData’:
Satija Lab (2019). panc8.SeuratData: Eight Pancreas
Datasets Across Five Technologies. R package version
3.0.2.
A BibTeX entry for LaTeX users is
@Manual{,
title = {panc8.SeuratData: Eight Pancreas Datasets Across Five Technologies},
author = {Satija Lab},
year = {2019},
note = {R package version 3.0.2},
}
Warning message:
In citation("panc8.SeuratData") :
程序包‘panc8.SeuratData’里的DESCRIPTION文件中没有日期这一域
> library(panc8.SeuratData)
> data("panc8")
> panc8
An object of class Seurat
34363 features across 14890 samples within 1 assay
Active assay: RNA (34363 features)
> summary(panc8)
Length Class Mode
1 Seurat S4
> citation('panc8.SeuratData')
在出版物中使用程序包时引用‘panc8.SeuratData’:
Satija Lab (2019). panc8.SeuratData: Eight Pancreas
Datasets Across Five Technologies. R package version
3.0.2.
A BibTeX entry for LaTeX users is
@Manual{,
title = {panc8.SeuratData: Eight Pancreas Datasets Across Five Technologies},
author = {Satija Lab},
year = {2019},
note = {R package version 3.0.2},
}
Warning message:
In citation("panc8.SeuratData") :
程序包‘panc8.SeuratData’里的DESCRIPTION文件中没有日期这一域
要构造参考数据集,我们将在各个数据集之间标识“锚”。首先,我们将组合的对象拆分为一个列表,每个数据集作为一个元素。
> pancreas.list <- SplitObject(panc8, split.by = "tech")
> summary(pancreas.list)
Length Class Mode
celseq 1 Seurat S4
celseq2 1 Seurat S4
smartseq2 1 Seurat S4
fluidigmc1 1 Seurat S4
indrop 1 Seurat S4
在找到锚之前,我们执行标准的预处理(log-normalization),并为每个锚分别识别变量特性。注意,Seurat v3实现了一种改进的基于方差稳定转换(“vst”)的变量特征选择方法。
pancreas.list <- pancreas.list[c("celseq", "celseq2", "fluidigmc1", "smartseq2")]
for (i in 1:length(pancreas.list)) {
pancreas.list[[i]] <- NormalizeData(pancreas.list[[i]], verbose = FALSE)
pancreas.list[[i]] <- FindVariableFeatures(pancreas.list[[i]], selection.method = "vst", nfeatures = 2000, verbose = FALSE)
}
整合3个胰岛细胞数据集
接下来,我们使用findintegrationanchor函数来标识锚,该函数接受Seurat对象的列表(list)作为输入。在这里,我们将三个对象集成到一个参考数据集中(稍后我们将使用第四个对象)。
> reference.list <- pancreas.list[c("celseq", "celseq2", "smartseq2")]
> pancreas.anchors <- FindIntegrationAnchors(object.list = reference.list, dims = 1:30)
Computing 2000 integration features
Scaling features for provided objects
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 02s
Finding all pairwise anchors
| | 0 % ~calculating Running CCA
Merging objects
Finding neighborhoods
Finding anchors
Found 3514 anchors
Filtering anchors
Retained 2761 anchors
Extracting within-dataset neighbors
|+++++++++++++++++ | 33% ~25s Running CCA
Merging objects
Finding neighborhoods
Finding anchors
Found 3500 anchors
Filtering anchors
Retained 2728 anchors
Extracting within-dataset neighbors
|++++++++++++++++++++++++++++++++++ | 67% ~12s Running CCA
Merging objects
Finding neighborhoods
Finding anchors
Found 6174 anchors
Filtering anchors
Retained 4561 anchors
Extracting within-dataset neighbors
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 49s
我们在这里使用所有默认参数来标识锚,包括数据集的“维数”(30;您可以尝试在较大范围内更改此参数,例如在10到50之间)。
然后我们将这些锚传递给IntegrateData函数,该函数返回一个Seurat对象。
返回的对象将包含一个新的Assay,其中包含一个针对所有细胞的完整(或“批量校正”)表达矩阵,使它们能够被联合分析。
> pancreas.integrated <- IntegrateData(anchorset = pancreas.anchors, dims = 1:30)
Merging dataset 1 into 2
Extracting anchors for merged samples
Finding integration vectors
Finding integration vector weights
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Integrating data
Merging dataset 3 into 2 1
Extracting anchors for merged samples
Finding integration vectors
Finding integration vector weights
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Integrating data
> head(pancreas.integrated@meta.data)
orig.ident nCount_RNA nFeature_RNA tech replicate
D101_5 D101 4615.810 1986 celseq celseq
D101_7 D101 29001.563 4209 celseq celseq
D101_10 D101 6707.857 2408 celseq celseq
D101_13 D101 8797.224 2964 celseq celseq
D101_14 D101 5032.558 2264 celseq celseq
D101_17 D101 13474.866 3982 celseq celseq
assigned_cluster celltype dataset
D101_5 <NA> gamma celseq
D101_7 <NA> acinar celseq
D101_10 <NA> alpha celseq
D101_13 <NA> delta celseq
D101_14 <NA> beta celseq
D101_17 <NA> ductal celseq
运行IntegrateData后,Seurat对象将包含一个使用集成表达矩阵的Assay 。注意,原始值(未更正的值)仍然存储在“RNA”测试中的对象中,因此您可以来回切换。
然后我们可以使用这个新的矩阵进行下游分析和可视化。在这里,我们缩放集成的数据,运行PCA,并使用UMAP可视化结果。集成的数据集按细胞类型聚类,而不是按技术分群。
library(ggplot2)
library(cowplot)
# switch to integrated assay. The variable features of this assay are automatically
# set during IntegrateData
DefaultAssay(pancreas.integrated) <- "integrated"
# Run the standard workflow for visualization and clustering
?ScaleData
pancreas.integrated <- ScaleData(pancreas.integrated, verbose = FALSE)
pancreas.integrated <- RunPCA(pancreas.integrated, npcs = 30, verbose = FALSE)
pancreas.integrated <- RunUMAP(pancreas.integrated, reduction = "pca", dims = 1:30)
Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session
20:55:25 Read 5683 rows and found 30 numeric columns
20:55:25 Using Annoy for neighbor search, n_neighbors = 30
20:55:26 Building Annoy index with metric = cosine, n_trees = 50
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
20:55:28 Writing NN index file to temp file C:\Users\ADMINI~1\AppData\Local\Temp\Rtmpo5X81p\file2870430b3367
20:55:28 Searching Annoy index using 1 thread, search_k = 3000
20:55:30 Annoy recall = 100%
20:55:31 Commencing smooth kNN distance calibration using 1 thread
20:55:31 Found 2 connected components, falling back to 'spca' initialization with init_sdev = 1
20:55:31 Initializing from PCA
20:55:31 PCA: 2 components explained 44.22% variance
20:55:32 Commencing optimization for 500 epochs, with 252460 positive edges
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
20:55:59 Optimization finished
p1 <- DimPlot(pancreas.integrated, reduction = "umap", group.by = "tech")
p2 <- DimPlot(pancreas.integrated, reduction = "umap", group.by = "celltype", label = TRUE,
repel = TRUE) + NoLegend()
plot_grid(p1, p2)
使用参考数据集对细胞类型进行分类
Seurat v3还支持将参考数据(或元数据)投影到查询对象上。虽然许多方法是一致的(这两个过程都是从识别锚开始的),但数据映射(data transfer)和整合之间有两个重要的区别:
在数据映射中,Seurat不纠正或修改查询表达式数据。
在数据映射中,Seurat有一个选项(默认设置)来将参考的PCA结构投射到查询上,而不是学习与CCA的联合结构。我们通常建议在scna -seq数据集之间投影数据时使用此选项。
找到锚之后,我们使用TransferData函数根据参考数据(参考数据的细胞类型标签向量)对查询数据集的细胞进行分类。TransferData返回一个带有预测id和预测分数的矩阵,我们可以将其添加到查询元数据中(query metadata)
> pancreas.query <- pancreas.list[["fluidigmc1"]]
> pancreas.anchors <- FindTransferAnchors(reference = pancreas.integrated, query = pancreas.query,
+ dims = 1:30)
Performing PCA on the provided reference using 2000 features as input.
Projecting PCA
Finding neighborhoods
Finding anchors
Found 953 anchors
Filtering anchors
Retained 885 anchors
Extracting within-dataset neighbors
> predictions <- TransferData(anchorset = pancreas.anchors, refdata = pancreas.integrated$celltype,
+ dims = 1:30)
Finding integration vectors
Finding integration vector weights
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Predicting cell labels
> pancreas.query <- AddMetaData(pancreas.query, metadata = predictions)
因为我们有来自完整集成分析的原始标签注释,所以我们可以评估预测的细胞类型注释与完整引用的匹配程度。在这个例子中,我们发现在细胞类型分类上有很高的一致性,超过97%的细胞被正确标记。
pancreas.query$prediction.match <- pancreas.query$predicted.id == pancreas.query$celltype
table(pancreas.query$prediction.match)
FALSE TRUE
18 620
为了进一步验证这一点,我们可以检查特定胰岛细胞群的一些典型细胞类型标记(cell type markers)。请注意,即使这些细胞类型中的一些只由一两个细胞(例如epsilon细胞),我们仍然能够正确地对它们进行分类。
> table(pancreas.query$predicted.id)
acinar activated_stellate alpha beta delta
21 17 248 258 22
ductal endothelial epsilon gamma macrophage
33 13 1 17 1
mast schwann
2 5
> VlnPlot(pancreas.query, c("REG1A", "PPY", "SST", "GHRL", "VWF", "SOX10"), group.by = "predicted.id")
https://satijalab.org/seurat/v3.2/integration.html
https://www.jianshu.com/p/d43f16bdfed9