Seurat-多组数据整合分析标准工作流程Integration and Label Transfer

标准工作流程

在这个例子工作流中，我们演示了我们最近在论文中引入的两种新方法，Comprehensive Integration of Single Cell Data:

将多个不同的scna -seq数据集组装到一个参考数据集中
将细胞类型标签从参考数据集转移到新的查询数据集

在本例中，我们选择了通过四种技术(CelSeq (GSE81076) CelSeq2 (GSE85241)、Fluidigm C1 (GSE86469)和SMART-Seq2 (E-MTAB-5061)生成的人类胰岛细胞数据集。为了方便起见，我们通过SeuratData包分发这个数据集。

新方法的代码在Seurat v3中实现。您可以使用install.packages从CRAN下载和安装。

除了新的方法之外，Seurat v3还包含了许多旨在改进Seurat对象和用户交互的改进。为了帮助用户熟悉这些更改，我们为常见任务编写了一个命令备忘单（ command cheat sheet）。

载入包和数据：

library(Seurat)
packageVersion("Seurat")

>[1] ‘3.1.0’

library(SeuratData)
#InstallData("pbmc3k")
#InstallData("panc8")
#install.packages("C:/Users/Administrator/Desktop/panc8.SeuratData_3.0.2.tar.gz", repos = NULL, type = "source")
#install.packages("C:/Users/Administrator/Desktop/pbmcsca.SeuratData_3.0.0.tar.gz", repos = NULL, type = "source")
> library(panc8.SeuratData)
> data("panc8")
> panc8
An object of class Seurat 
34363 features across 14890 samples within 1 assay 
Active assay: RNA (34363 features)
> summary(panc8)
Length  Class   Mode 
     1 Seurat     S4 
> citation('panc8.SeuratData')

在出版物中使用程序包时引用‘panc8.SeuratData’:

  Satija Lab (2019). panc8.SeuratData: Eight Pancreas
  Datasets Across Five Technologies. R package version
  3.0.2.

A BibTeX entry for LaTeX users is

  @Manual{,
    title = {panc8.SeuratData: Eight Pancreas Datasets Across Five Technologies},
    author = {Satija Lab},
    year = {2019},
    note = {R package version 3.0.2},
  }

Warning message:
In citation("panc8.SeuratData") :
  程序包‘panc8.SeuratData’里的DESCRIPTION文件中没有日期这一域
> library(panc8.SeuratData)
> data("panc8")
> panc8
An object of class Seurat 
34363 features across 14890 samples within 1 assay 
Active assay: RNA (34363 features)
> summary(panc8)
Length  Class   Mode 
     1 Seurat     S4 
> citation('panc8.SeuratData')

在出版物中使用程序包时引用‘panc8.SeuratData’:

  Satija Lab (2019). panc8.SeuratData: Eight Pancreas
  Datasets Across Five Technologies. R package version
  3.0.2.

A BibTeX entry for LaTeX users is

  @Manual{,
    title = {panc8.SeuratData: Eight Pancreas Datasets Across Five Technologies},
    author = {Satija Lab},
    year = {2019},
    note = {R package version 3.0.2},
  }

Warning message:
In citation("panc8.SeuratData") :
  程序包‘panc8.SeuratData’里的DESCRIPTION文件中没有日期这一域

要构造参考数据集，我们将在各个数据集之间标识“锚”。首先，我们将组合的对象拆分为一个列表，每个数据集作为一个元素。

> pancreas.list <- SplitObject(panc8, split.by = "tech")
> summary(pancreas.list)
           Length Class  Mode
celseq     1      Seurat S4  
celseq2    1      Seurat S4  
smartseq2  1      Seurat S4  
fluidigmc1 1      Seurat S4  
indrop     1      Seurat S4

在找到锚之前，我们执行标准的预处理(log-normalization)，并为每个锚分别识别变量特性。注意，Seurat v3实现了一种改进的基于方差稳定转换(“vst”)的变量特征选择方法。

pancreas.list <- pancreas.list[c("celseq", "celseq2", "fluidigmc1", "smartseq2")]
for (i in 1:length(pancreas.list)) {
  pancreas.list[[i]] <- NormalizeData(pancreas.list[[i]], verbose = FALSE)
  pancreas.list[[i]] <- FindVariableFeatures(pancreas.list[[i]], selection.method = "vst",  nfeatures = 2000, verbose = FALSE)
}

整合3个胰岛细胞数据集

接下来，我们使用findintegrationanchor函数来标识锚，该函数接受Seurat对象的列表（list）作为输入。在这里，我们将三个对象集成到一个参考数据集中(稍后我们将使用第四个对象)。

> reference.list <- pancreas.list[c("celseq", "celseq2", "smartseq2")]
> pancreas.anchors <- FindIntegrationAnchors(object.list = reference.list, dims = 1:30)
Computing 2000 integration features
Scaling features for provided objects
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 02s
Finding all pairwise anchors
  |                                                  | 0 % ~calculating  Running CCA
Merging objects
Finding neighborhoods
Finding anchors
    Found 3514 anchors
Filtering anchors
    Retained 2761 anchors
Extracting within-dataset neighbors
  |+++++++++++++++++                                 | 33% ~25s          Running CCA
Merging objects
Finding neighborhoods
Finding anchors
    Found 3500 anchors
Filtering anchors
    Retained 2728 anchors
Extracting within-dataset neighbors
  |++++++++++++++++++++++++++++++++++                | 67% ~12s          Running CCA
Merging objects
Finding neighborhoods
Finding anchors
    Found 6174 anchors
Filtering anchors
    Retained 4561 anchors
Extracting within-dataset neighbors
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 49s

我们在这里使用所有默认参数来标识锚，包括数据集的“维数”(30;您可以尝试在较大范围内更改此参数，例如在10到50之间)。

然后我们将这些锚传递给IntegrateData函数，该函数返回一个Seurat对象。
返回的对象将包含一个新的Assay，其中包含一个针对所有细胞的完整(或“批量校正”)表达矩阵，使它们能够被联合分析。

> pancreas.integrated <- IntegrateData(anchorset = pancreas.anchors, dims = 1:30)
Merging dataset 1 into 2
Extracting anchors for merged samples
Finding integration vectors
Finding integration vector weights
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Integrating data
Merging dataset 3 into 2 1
Extracting anchors for merged samples
Finding integration vectors
Finding integration vector weights
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Integrating data
> head(pancreas.integrated@meta.data)
        orig.ident nCount_RNA nFeature_RNA   tech replicate
D101_5        D101   4615.810         1986 celseq    celseq
D101_7        D101  29001.563         4209 celseq    celseq
D101_10       D101   6707.857         2408 celseq    celseq
D101_13       D101   8797.224         2964 celseq    celseq
D101_14       D101   5032.558         2264 celseq    celseq
D101_17       D101  13474.866         3982 celseq    celseq
        assigned_cluster celltype dataset
D101_5              <NA>    gamma  celseq
D101_7              <NA>   acinar  celseq
D101_10             <NA>    alpha  celseq
D101_13             <NA>    delta  celseq
D101_14             <NA>     beta  celseq
D101_17             <NA>   ductal  celseq

运行IntegrateData后，Seurat对象将包含一个使用集成表达矩阵的Assay 。注意，原始值(未更正的值)仍然存储在“RNA”测试中的对象中，因此您可以来回切换。
然后我们可以使用这个新的矩阵进行下游分析和可视化。在这里，我们缩放集成的数据，运行PCA，并使用UMAP可视化结果。集成的数据集按细胞类型聚类，而不是按技术分群。

library(ggplot2)
library(cowplot)
# switch to integrated assay. The variable features of this assay are automatically
# set during IntegrateData
DefaultAssay(pancreas.integrated) <- "integrated"

# Run the standard workflow for visualization and clustering
?ScaleData
pancreas.integrated <- ScaleData(pancreas.integrated, verbose = FALSE)

pancreas.integrated <- RunPCA(pancreas.integrated, npcs = 30, verbose = FALSE)
pancreas.integrated <- RunUMAP(pancreas.integrated, reduction = "pca", dims = 1:30)
Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session
20:55:25 Read 5683 rows and found 30 numeric columns
20:55:25 Using Annoy for neighbor search, n_neighbors = 30
20:55:26 Building Annoy index with metric = cosine, n_trees = 50
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
20:55:28 Writing NN index file to temp file C:\Users\ADMINI~1\AppData\Local\Temp\Rtmpo5X81p\file2870430b3367
20:55:28 Searching Annoy index using 1 thread, search_k = 3000
20:55:30 Annoy recall = 100%
20:55:31 Commencing smooth kNN distance calibration using 1 thread
20:55:31 Found 2 connected components, falling back to 'spca' initialization with init_sdev = 1
20:55:31 Initializing from PCA
20:55:31 PCA: 2 components explained 44.22% variance
20:55:32 Commencing optimization for 500 epochs, with 252460 positive edges
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
20:55:59 Optimization finished
p1 <- DimPlot(pancreas.integrated, reduction = "umap", group.by = "tech")
p2 <- DimPlot(pancreas.integrated, reduction = "umap", group.by = "celltype", label = TRUE, 
              repel = TRUE) + NoLegend()
plot_grid(p1, p2)

image.png

使用参考数据集对细胞类型进行分类

Seurat v3还支持将参考数据(或元数据)投影到查询对象上。虽然许多方法是一致的(这两个过程都是从识别锚开始的)，但数据映射（data transfer）和整合之间有两个重要的区别:

在数据映射中，Seurat不纠正或修改查询表达式数据。
在数据映射中，Seurat有一个选项(默认设置)来将参考的PCA结构投射到查询上，而不是学习与CCA的联合结构。我们通常建议在scna -seq数据集之间投影数据时使用此选项。

找到锚之后，我们使用TransferData函数根据参考数据(参考数据的细胞类型标签向量)对查询数据集的细胞进行分类。TransferData返回一个带有预测id和预测分数的矩阵，我们可以将其添加到查询元数据中（query metadata）

> pancreas.query <- pancreas.list[["fluidigmc1"]]
> pancreas.anchors <- FindTransferAnchors(reference = pancreas.integrated, query = pancreas.query, 
+                                         dims = 1:30)
Performing PCA on the provided reference using 2000 features as input.
Projecting PCA
Finding neighborhoods
Finding anchors
    Found 953 anchors
Filtering anchors
    Retained 885 anchors
Extracting within-dataset neighbors
> predictions <- TransferData(anchorset = pancreas.anchors, refdata = pancreas.integrated$celltype, 
+                             dims = 1:30)
Finding integration vectors
Finding integration vector weights
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Predicting cell labels
> pancreas.query <- AddMetaData(pancreas.query, metadata = predictions)

因为我们有来自完整集成分析的原始标签注释，所以我们可以评估预测的细胞类型注释与完整引用的匹配程度。在这个例子中，我们发现在细胞类型分类上有很高的一致性，超过97%的细胞被正确标记。

pancreas.query$prediction.match <- pancreas.query$predicted.id == pancreas.query$celltype
table(pancreas.query$prediction.match)

FALSE  TRUE 
   18   620

为了进一步验证这一点，我们可以检查特定胰岛细胞群的一些典型细胞类型标记（cell type markers）。请注意，即使这些细胞类型中的一些只由一两个细胞(例如epsilon细胞)，我们仍然能够正确地对它们进行分类。

> table(pancreas.query$predicted.id)

            acinar activated_stellate              alpha               beta              delta 
                21                 17                248                258                 22 
            ductal        endothelial            epsilon              gamma         macrophage 
                33                 13                  1                 17                  1 
              mast            schwann 
                 2                  5 
> VlnPlot(pancreas.query, c("REG1A", "PPY", "SST", "GHRL", "VWF", "SOX10"), group.by = "predicted.id")

image.png

https://satijalab.org/seurat/v3.2/integration.html
https://www.jianshu.com/p/d43f16bdfed9

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 204,684评论 6赞 478
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 87,143评论 2赞 381
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 151,214评论 0赞 337
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 54,788评论 1赞 277
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 63,796评论 5赞 368
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,665评论 1赞 281
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 38,027评论 3赞 399
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,679评论 0赞 258
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 41,346评论 1赞 299
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,664评论 2赞 321
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,766评论 1赞 331
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,412评论 4赞 321
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 39,015评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,974评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,203评论 1赞 260
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 45,073评论 2赞 350
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,501评论 2赞 343

Seurat-多组数据整合分析标准工作流程Integration and Label Transfer

标准工作流程

推荐阅读更多精彩内容