单细胞测序数据整合练习(详细代码)

最近在学习芬兰CSC-IT科学中心主讲的生物信息课程(https://www.csc.fi/web/training/-/scrnaseq)视频,官网上还提供了练习素材以及详细代码,今天就来练习一下单细胞数据整合的过程。跟着官网的代码走一遍:

https://github.com/NBISweden/excelerate-scRNAseq/blob/master/session-integration/Data_Integration.md

该练习中使用两种方法进行多个单细胞测序dataset的整合,之后进行批次效应的去除,并且定量评估整合后的数据质量。练习中的datasets分别来自:CelSeq (GSE81076) CelSeq2 (GSE85241), Fluidigm C1 (GSE86469), and SMART-Seq2 (E-MTAB-5061)。原始矩阵和相关metadata在这里下载。(这里需要注意的是,作者上传的这个矩阵是已经经过整合的,但是并没有去除批次效应,后面代码里会将这个矩阵拆分成4个datasets,然后再进行整合)

开始之前,加载R包:

> library("Seurat")
> library("ggplot2")
> library("cowplot")
> library("scater")
> library("scran")
> library("BiocParallel")
> library("BiocNeighbors")

(一)利用Seurat (anchors and CCA) 方法进行数据整合以及批次效应处理

加载表达矩阵和metadata,其中metadata里包含测序平台(列),细胞类型注释(列)

> pancreas.data <- readRDS(file = "pancreas_expression_matrix.rds")
> metadata <- readRDS(file = "pancreas_metadata.rds")

看一下这个metadata:

创建seurat对象:

> pancreas <- CreateSeuratObject(pancreas.data, meta.data = metadata)

在做任何批次效应处理之前,都要先查看一下dataset,我们先做标准的预处理(log-标准化),然后识别变量(“vst”),接下来scale整合后的data,跑PCA和可视化,再将整合后的细胞分群(cluster)

# 标准化并且寻找变量(variable features)
> pancreas <- NormalizeData(pancreas, verbose = FALSE)
> pancreas <- FindVariableFeatures(pancreas, selection.method = "vst", nfeatures = 2000, verbose = FALSE)
# 跑标准的流程(可视化和clustering)
> pancreas <- ScaleData(pancreas, verbose = FALSE)
> pancreas <- RunPCA(pancreas, npcs = 30, verbose = FALSE)
> pancreas <- RunUMAP(pancreas, reduction = "pca", dims = 1:30)
> p1 <- DimPlot(pancreas, reduction = "umap", group.by = "tech")
> p2 <- DimPlot(pancreas, reduction = "umap", group.by = "celltype", label = TRUE, repel = TRUE) + 
  NoLegend()
> plot_grid(p1, p2)
这是这个整合后的数据,但是这个数据并没有去除批次效应。左图里4个不同平台测序的结果重合度很低,右图里根据细胞类型分群也没有很好的clustering

下面作者将这个整合的数据拆分成一个列表(包含4个不同的datasets),每一个dataset作为一个元素。进行标准的预处理(log-normalization),识别每一个datset的变量特征("vst"):

> pancreas.list <- SplitObject(pancreas, split.by = "tech")

> for (i in 1:length(pancreas.list)) {
  pancreas.list[[i]] <- NormalizeData(pancreas.list[[i]], verbose = FALSE)
  pancreas.list[[i]] <- FindVariableFeatures(pancreas.list[[i]], selection.method = "vst", nfeatures = 2000, 
                                             verbose = FALSE)
}

整合4个胰岛细胞的datasets
利用FindIntegrationAnchors功能识别anchor,seurat对象列表作为输入:

> reference.list <- pancreas.list[c("celseq", "celseq2", "smartseq2", "fluidigmc1")]
> pancreas.anchors <- FindIntegrationAnchors(object.list = reference.list, dims = 1:30)

Computing 2000 integration features
Scaling features for provided objects
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01s  
Finding all pairwise anchors
  |                                                  | 0 % ~calculating  Running CCA
Merging objects
Finding neighborhoods
Finding anchors
    Found 3499 anchors
Filtering anchors
    Retained 2821 anchors
Extracting within-dataset neighbors
  |+++++++++                                         | 17% ~01m 01s      Running CCA
Merging objects
Finding neighborhoods
Finding anchors
    Found 3515 anchors
Filtering anchors
    Retained 2701 anchors
Extracting within-dataset neighbors
  |+++++++++++++++++                                 | 33% ~49s          Running CCA
Merging objects
Finding neighborhoods
Finding anchors
    Found 6173 anchors
Filtering anchors
    Retained 4634 anchors
Extracting within-dataset neighbors
  |+++++++++++++++++++++++++                         | 50% ~50s          Running CCA
Merging objects
Finding neighborhoods
Finding anchors
    Found 2176 anchors
Filtering anchors
    Retained 1841 anchors
Extracting within-dataset neighbors
  |++++++++++++++++++++++++++++++++++                | 67% ~27s          Running CCA
Merging objects
Finding neighborhoods
Finding anchors
    Found 2774 anchors
Filtering anchors
    Retained 2478 anchors
Extracting within-dataset neighbors
  |++++++++++++++++++++++++++++++++++++++++++        | 83% ~12s          Running CCA
Merging objects
Finding neighborhoods
Finding anchors
    Found 2723 anchors
Filtering anchors
    Retained 2410 anchors
Extracting within-dataset neighbors
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01m 10s

然后将上面这些anchors传递给IntegrateData函数,该函数返回一个Seurat对象:

> pancreas.integrated <- IntegrateData(anchorset = pancreas.anchors, dims = 1:30)

运行IntegrateData后,Seurat对象将包含一个新的整合后的(或“批量校正”)表达矩阵的Assay,请注意,原始矩阵(未修正的值)仍然存储在Seurat对象的RNA Assay中,因此可以来回切换。

然后我们可以使用这个新的整合的矩阵进行下游分析和可视化。在这里,我们scale整合的数据,运行PCA,并使用UMAP可视化结果。整合的数据集按细胞类型cluster,而不是按技术。

#切换到整合后的assay
> DefaultAssay(pancreas.integrated) <- "integrated"

跑标准流程(可视化和clustering):

> pancreas.integrated <- ScaleData(pancreas.integrated, verbose = FALSE)
> pancreas.integrated <- RunPCA(pancreas.integrated, npcs = 30, verbose = FALSE)
> pancreas.integrated <- RunUMAP(pancreas.integrated, reduction = "pca", dims = 1:30)
> p3 <- DimPlot(pancreas.integrated, reduction = "umap", group.by = "tech")
> p4 <- DimPlot(pancreas.integrated, reduction = "umap", group.by = "celltype", label = TRUE, repel = TRUE) + 
   NoLegend()
> plot_grid(p3, p4)
这时的图就是拆分后再次整合、去除批次效应之后的图了。左图的4个平台测序分类的结果重叠度很高,右图按照细胞类型分类的clustering结果也很好

(二)利用Mutual Nearest Neighbor (MNN)方法进行数据整合

你可以用count矩阵创建一个singlecellexper(SCE)对象,也可以从Seurat转换成SCE对象:

> celseq.data <- as.SingleCellExperiment(pancreas.list$celseq)
> celseq2.data <- as.SingleCellExperiment(pancreas.list$celseq2)
> fluidigmc1.data <- as.SingleCellExperiment(pancreas.list$fluidigmc1)
> smartseq2.data <- as.SingleCellExperiment(pancreas.list$smartseq2)

寻找共同的基因,并且把每个dataset简化成由那些共同基因组成的dataset:

> keep_genes <- Reduce(intersect, list(rownames(celseq.data),rownames(celseq2.data),
+                                      rownames(fluidigmc1.data),rownames(smartseq2.data)))
> celseq.data <- celseq.data[match(keep_genes, rownames(celseq.data)), ]
> celseq2.data <- celseq2.data[match(keep_genes, rownames(celseq2.data)), ]
> fluidigmc1.data <- fluidigmc1.data[match(keep_genes, rownames(fluidigmc1.data)), ]
> smartseq2.data <- smartseq2.data[match(keep_genes, rownames(smartseq2.data)), ]

接下来使用calculateQCMetrics()计算质量控制特征,通过发现异常count数低的或可检测到的基因总数少的异常值来确定低质量细胞:

# 处理celseq.data
> celseq.data <- calculateQCMetrics(celseq.data)
> low_lib_celseq.data <- isOutlier(celseq.data$log10_total_counts, type="lower", nmad=3)
> low_genes_celseq.data <- isOutlier(celseq.data$log10_total_features_by_counts, type="lower", nmad=3)
> celseq.data <- celseq.data[, !(low_lib_celseq.data | low_genes_celseq.data)]
# 处理celseq2.data
> celseq2.data <- calculateQCMetrics(celseq2.data)
> low_lib_celseq2.data <- isOutlier(celseq2.data$log10_total_counts, type="lower", nmad=3)
> low_genes_celseq2.data <- isOutlier(celseq2.data$log10_total_features_by_counts, type="lower", nmad=3)
> celseq2.data <- celseq2.data[, !(low_lib_celseq2.data | low_genes_celseq2.data)]
# 处理fluidigmc1.data
> fluidigmc1.data <- calculateQCMetrics(fluidigmc1.data)
> low_lib_fluidigmc1.data <- isOutlier(fluidigmc1.data$log10_total_counts, type="lower", nmad=3)
> low_genes_fluidigmc1.data <- isOutlier(fluidigmc1.data$log10_total_features_by_counts, type="lower", nmad=3)
> fluidigmc1.data <- fluidigmc1.data[, !(low_lib_fluidigmc1.data | low_genes_fluidigmc1.data)]
# 处理smartseq2.data
> smartseq2.data <- calculateQCMetrics(smartseq2.data)
> low_lib_smartseq2.data <- isOutlier(smartseq2.data$log10_total_counts, type="lower", nmad=3)
> low_genes_smartseq2.data <- isOutlier(smartseq2.data$log10_total_features_by_counts, type="lower", nmad=3)
> smartseq2.data <- smartseq2.data[, !(low_lib_smartseq2.data | low_genes_smartseq2.data)]

然后使用computeSumFactors()和scran包的Normalize()函数计算sizefactor来标准化数据:

# Compute sizefactors
> celseq.data <- computeSumFactors(celseq.data)
> celseq2.data <- computeSumFactors(celseq2.data)
> fluidigmc1.data <- computeSumFactors(fluidigmc1.data)
> smartseq2.data <- computeSumFactors(smartseq2.data)
# Normalize
> celseq.data <- normalize(celseq.data)
> celseq2.data <- normalize(celseq2.data)
> fluidigmc1.data <- normalize(fluidigmc1.data)
> smartseq2.data <- normalize(smartseq2.data)

features(基因)选择:使用trendVar()和decomposeVar()函数来计算每个基因的variance,并将其分为技术variance和生物学的variance:

# celseq.data
> fit_celseq.data <- trendVar(celseq.data, use.spikes=FALSE) 
> dec_celseq.data <- decomposeVar(celseq.data, fit_celseq.data)
> dec_celseq.data$Symbol_TENx <- rowData(celseq.data)$Symbol_TENx
> dec_celseq.data <- dec_celseq.data[order(dec_celseq.data$bio, decreasing = TRUE), ]
# celseq2.data
> fit_celseq2.data <- trendVar(celseq2.data, use.spikes=FALSE) 
> dec_celseq2.data <- decomposeVar(celseq2.data, fit_celseq2.data)
> dec_celseq2.data$Symbol_TENx <- rowData(celseq2.data)$Symbol_TENx
> dec_celseq2.data <- dec_celseq2.data[order(dec_celseq2.data$bio, decreasing = TRUE), ]
# fluidigmc1.data
> fit_fluidigmc1.data <- trendVar(fluidigmc1.data, use.spikes=FALSE) 
> dec_fluidigmc1.data <- decomposeVar(fluidigmc1.data, fit_fluidigmc1.data)
> dec_fluidigmc1.data$Symbol_TENx <- rowData(fluidigmc1.data)$Symbol_TENx
> dec_fluidigmc1.data <- dec_fluidigmc1.data[order(dec_fluidigmc1.data$bio, decreasing = TRUE), ]
# smartseq2.data
> fit_smartseq2.data <- trendVar(smartseq2.data, use.spikes=FALSE) 
> dec_smartseq2.data <- decomposeVar(smartseq2.data, fit_smartseq2.data)
> dec_smartseq2.data$Symbol_TENx <- rowData(smartseq2.data)$Symbol_TENx
> dec_smartseq2.data <- dec_smartseq2.data[order(dec_smartseq2.data$bio, decreasing = TRUE), ]
# 选择最能提供信息的基因,这些基因在所有的dataset里都表达
> universe <- Reduce(intersect, list(rownames(dec_celseq.data),rownames(dec_celseq2.data),
                                   rownames(dec_fluidigmc1.data),rownames(dec_smartseq2.data)))
> mean.bio <- (dec_celseq.data[universe,"bio"] + dec_celseq2.data[universe,"bio"] + 
                dec_fluidigmc1.data[universe,"bio"] + dec_smartseq2.data[universe,"bio"])/4
> hvg_genes <- universe[mean.bio > 0]

将这些datasets结合到一个统一的SingleCellExperiment里:

# 总原始counts的整合
> counts_pancreas <- cbind(counts(celseq.data), counts(celseq2.data), 
                          counts(fluidigmc1.data), counts(smartseq2.data))
# 总的标准化后的counts整合 (with multibatch normalization)
> logcounts_pancreas <- cbind(logcounts(celseq.data), logcounts(celseq2.data), 
                             logcounts(fluidigmc1.data), logcounts(smartseq2.data))
# 构建整合数据的sce对象
> sce <- SingleCellExperiment( 
   assays = list(counts = counts_pancreas, logcounts = logcounts_pancreas),  
   rowData = rowData(celseq.data), # same as rowData(pbmc4k) 
   colData = rbind(colData(celseq.data), colData(celseq2.data), 
                   colData(fluidigmc1.data), colData(smartseq2.data)) 
 )
# 将前面的hvg_genes存储到sce对象的metadata slot中 
> metadata(sce)$hvg_genes <- hvg_genes

用MNN处理批次效应之前先看一下这些datasets:

> sce <- runPCA(sce,
               ncomponents = 20,
               feature_set = hvg_genes,
               method = "irlba")
> 
> names(reducedDims(sce)) <- "PCA_naive" 
> 
> p5 <- plotReducedDim(sce, use_dimred = "PCA_naive", colour_by = "tech") + 
   ggtitle("PCA Without batch correction")
> p6 <- plotReducedDim(sce, use_dimred = "PCA_naive", colour_by = "celltype") + 
   ggtitle("PCA Without batch correction")
> plot_grid(p5, p6)
去除批次效应之前

使用fastMNN() 功能处理批次效应。跑fastMNN()之前,我们需要先rescale每一个批次,来调整不同批次之间的测序深度。用scran包里的multiBatchNorm()功能对size factor进行调整后,重新计算log标准化的表达值,以适应不同SingleCellExperiment对象的系统差异。之前的size factors仅能移除单个批次里细胞之间的bias。现在我们要通过消除批次之间技术差异来提高了校正的质量:

> rescaled <- multiBatchNorm(celseq.data, celseq2.data, fluidigmc1.data, smartseq2.data) 
> celseq.data_rescaled <- rescaled[[1]]
> celseq2.data_rescaled <- rescaled[[2]]
> fluidigmc1.data_rescaled <- rescaled[[3]]
> smartseq2.data_rescaled <- rescaled[[4]]

跑fastMNN,把降维的MNN representation存在sce对象的 reducedDims slot里:

> mnn_out <- fastMNN(celseq.data_rescaled, 
                   celseq2.data_rescaled,
                   fluidigmc1.data_rescaled,
                   smartseq2.data_rescaled,
                   subset.row = metadata(sce)$hvg_genes,
                   k = 20, d = 50, approximate = TRUE,
                   # BPPARAM = BiocParallel::MulticoreParam(8),
                   BNPARAM = BiocNeighbors::AnnoyParam())

> reducedDim(sce, "MNN") <- mnn_out$correct

需要注意的是,fastMNN()不会生成批次处理后的表达矩阵。因此,fastMNN()的结果只能作为降维表示,适用于直接绘图、TSNE/UMAP、聚类和轨迹分析。
画批次矫正后的图:

> p7 <- plotReducedDim(sce, use_dimred = "MNN", colour_by = "tech") + ggtitle("MNN Ouput Reduced Dimensions")
> p8 <- plotReducedDim(sce, use_dimred = "MNN", colour_by = "celltype") + ggtitle("MNN Ouput Reduced Dimensions")
> plot_grid(p7, p8)
去除批次效应之后的图
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
禁止转载,如需转载请通过简信或评论联系作者。
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 199,830评论 5 468
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 83,992评论 2 376
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 146,875评论 0 331
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 53,837评论 1 271
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 62,734评论 5 360
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,091评论 1 277
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 37,550评论 3 390
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,217评论 0 254
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 40,368评论 1 294
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,298评论 2 317
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,350评论 1 329
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,027评论 3 315
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 38,623评论 3 303
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,706评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 30,940评论 1 255
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 42,349评论 2 346
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 41,936评论 2 341

推荐阅读更多精彩内容