接前面的流程,我们使用FindIntegrationAnchors()函数识别锚点,该函数将Seurat对象列表作为输入,并使用IntegratedData()函数,用这些锚点将两个数据集集成在一起。
ifnb.list
# $CTRL
# An object of class Seurat
# 14053 features across 6548 samples within 1 assay
# Active assay: RNA (14053 features, 2000 variable features)
#
# $STIM
# An object of class Seurat
# 14053 features across 7451 samples within 1 assay
# Active assay: RNA (14053 features, 2000 variable features)
immune.anchors <- FindIntegrationAnchors(object.list = ifnb.list, anchor.features = features)
# this command creates an 'integrated' data assay
immune.combined <- IntegrateData(anchorset = immune.anchors)
先跟着FindIntegrationAnchors()函数过一遍代码,看看如何实现的。
step1:传参:seurat对象和高可变基因传入object.list和anchor.features参数,其余均为默认参数。看下seurat官网对这个函数参数的解释。
object.list = ifnb.list
anchor.features = features
assay = NULL
reference = NULL
scale = TRUE
normalization.method = "LogNormalize"
sct.clip.range = NULL
reduction = "cca"
l2.norm = TRUE
dims = 1:30
k.anchor = 5
k.filter = 200
k.score = 30
max.features = 200
nn.method = "annoy"
n.trees = 50
eps = 0
verbose = TRUE
step2:并行运算:FindIntegrationAnchors()函数是可以做并行运算的,先暂时忽略此步。详见https://satijalab.org/seurat/articles/future_vignette.html。
my.lapply <- ifelse(
test = verbose && future::nbrOfWorkers() == 1,
yes = pbapply::pblapply,
no = future.apply::future_lapply
)
step3:数据校验:查看每个样本的细胞数目,默认是30维(dims = 1:30),每个样本的细胞数目要大于dims值。另外,确保每个样本seurat对象的Integration插槽为空,样本的barcode信息不重复。
object.ncells <- sapply(X = object.list, FUN = function(x) dim(x = x)[2])
object.ncells
# CTRL STIM
# 6548 7451
assay <- sapply(X = object.list, FUN = DefaultAssay)
assay
# CTRL STIM
# "RNA" "RNA"
# check tool
object.list <- lapply(
X = object.list,
FUN = function (obj) {
slot(object = obj, name = "tools")$Integration <- NULL
return(obj)
})
object.list <- CheckDuplicateCellNames(object.list = object.list)
step4:基因表达值缩放:对每个样本seurat对象的高可变基因进行scale缩放;
slot <- "data"
if (scale) {
if (verbose) {
message("Scaling features for provided objects")
}
object.list <- my.lapply(
X = object.list,
FUN = function(object) {
ScaleData(object = object, features = anchor.features, verbose = FALSE)
}
)
}
# 查看scale缩放后的矩阵
head(object.list[['CTRL']]@assays$RNA@scale.data[anchor.features,1:5])
step5:采用cca降维方法,计算两个样本有几种组合的方式combinations,expand.grid()函数返回几个元素所有可能的组合,2个样本有1种组合。确定锚点索引的偏移量,当前样本的偏移量为前面样本细胞数目的累加值。
比如有两个样本的细胞数为6548,7451 ,它们的偏移量为前面样本数目的累加值 0,6548。第一个样本的偏移量为0。
nn.reduction <- reduction
# determine pairwise combinations
combinations <- expand.grid(1:length(x = object.list), 1:length(x = object.list))
combinations <- combinations[combinations$Var1 < combinations$Var2, , drop = FALSE]
combinations
# Var1 Var2
# 3 1 2
# determine the proper offsets for indexing anchors
objects.ncell <- sapply(X = object.list, FUN = ncol)
objects.ncell
# CTRL STIM
# 6548 7451
offsets <- as.vector(x = cumsum(x = c(0, objects.ncell)))[1:length(x = object.list)]
offsets
# [1] 0 6548
step6:因为无reference,寻找成对的锚点。对每个样本间组合调用核心的锚点函数anchoring.fxn。anchoring.fxn函数较复杂,又会调用RunCCA和FindAnchors等函数。
我们的样本就一组样本组合(1-2),通过这个实操逐步分解代码,进入到anchoring.fxn函数。
step6-1:object.1和object.2分别为两个样本只包含高可变基因的seurat对象,包含RNA插槽的数据。重新生成ToIntegrate插槽,包含归一化和缩放的数据。
object.1 <- DietSeurat(
object = object.list[[i]],
assays = assay[i],
features = anchor.features,
counts = FALSE,
scale.data = TRUE,
dimreducs = reduction
)
object.2 <- DietSeurat(
object = object.list[[j]],
assays = assay[j],
features = anchor.features,
counts = FALSE,
scale.data = TRUE,
dimreducs = reduction
)
head(object.1@assays$RNA@data[1:5,1:5])
head(object.1@assays$RNA@scale.data[1:5,1:5])
# suppress key duplication warning
suppressWarnings(object.1[["ToIntegrate"]] <- object.1[[assay[i]]])
DefaultAssay(object = object.1) <- "ToIntegrate"
object.1 <- DietSeurat(object = object.1, assays = "ToIntegrate", scale.data = TRUE, dimreducs = reduction)
suppressWarnings(object.2[["ToIntegrate"]] <- object.2[[assay[j]]])
DefaultAssay(object = object.2) <- "ToIntegrate"
object.2 <- DietSeurat(object = object.2, assays = "ToIntegrate", scale.data = TRUE, dimreducs = reduction)
head(object.1@assays$ToIntegrate@data[1:5,1:5])
head(object.1@assays$ToIntegrate@scale.data[1:5,1:5])
step6-2:针对cca,pca,lsi三种降维方式,查找配对的锚点,此步骤调用RunCCA,将前面生成的object.1,object.2传入该函数,我们只关注cca降维。
object.pair <- switch(
EXPR = reduction,
'cca' = {
object.pair <- RunCCA(
object1 = object.1,
object2 = object.2,
assay1 = "ToIntegrate",
assay2 = "ToIntegrate",
features = anchor.features,
num.cc = max(dims),
renormalize = FALSE,
rescale = FALSE,
verbose = verbose
)
if (l2.norm){
object.pair <- L2Dim(object = object.pair, reduction = reduction)
reduction <- paste0(reduction, ".l2")
nn.reduction <- reduction
}
reduction.2 <- character()
object.pair
},
'pca' = {... },
'lsi' = {... },
stop("Invalid reduction parameter. Please choose either cca, rpca, or rlsi")
)
step6-3:进入到RunCCA函数,由于RunCCA函数为S3泛型函数,object.1和object.2均为seurat对象,先进入RunCCA.Seurat函数。
RunCCA.Seurat传入的参数
object1 = object.1
object2 = object.2
assay1 = "ToIntegrate"
assay2 = "ToIntegrate"
features = anchor.features
num.cc = max(dims),
renormalize = FALSE
rescale = FALSE
compute.gene.loadings = TRUE
add.cell.id1 = NULL
add.cell.id2 = NULL
rescale = FALSE
verbose = TRUE
RunCCA.Seurat代码比较好理解,从object.1和object.2提取高可变基因的缩放后的表达矩阵,生成data1和data2。此处有个细节,CheckFeatures函数使高可变基因由2000个,变成了1866个。
# ToIntegrate
assay1 <- assay1 %||% DefaultAssay(object = object1)
assay2 <- assay2 %||% DefaultAssay(object = object2)
nfeatures <- length(x = features)
if (!(rescale)) {
data.use1 <- GetAssayData(object = object1, assay = assay1, slot = "scale.data")
data.use2 <- GetAssayData(object = object2, assay = assay2, slot = "scale.data")
features <- CheckFeatures(data.use = data.use1, features = features, object.name = "object1", verbose = FALSE)
length(features)
# [1] 1938
features <- CheckFeatures(data.use = data.use2, features = features, object.name = "object2", verbose = FALSE)
length(features)
# [1] 1866
data1 <- data.use1[features, ]
data2 <- data.use2[features, ]
}
cca.results <- RunCCA(
object1 = data1,
object2 = data2,
standardize = TRUE,
num.cc = num.cc,
verbose = verbose,
)
问题1:CheckFeatures函数做了什么操作,为什么要处理?
对于一个高可变基因缩放后的表达矩阵(基因数为2000),在两个样本,分别按列计算每个基因在所有细胞的方差,去除方差为0的基因,去除后剩下1866个。这些基因的scale.data数值也为零值。这些在某个样本均为零值的基因会影响下游找锚点的步骤。
features.var <- RowVar(x = data.use[features, ])
no.var.features <- features[features.var == 0]
features <- setdiff(x = features, y = no.var.features)
features <- features[!is.na(x = features)]
/* Calculates the variance of rows of a matrix */
// [[Rcpp::export(rng = false)]]
NumericVector RowVar(Eigen::Map<Eigen::MatrixXd> x){
NumericVector out(x.rows());
for(int i=0; i < x.rows(); ++i){
Eigen::ArrayXd r = x.row(i).array();
double rowMean = r.mean();
out[i] = (r - rowMean).square().sum() / (x.cols() - 1);
}
return out;
}
进入到RunCCA.default函数,查看该函数,cells1和cells2为每个样本的细胞barcode信息,standardize步骤是针对每个样本高可变基因的缩放数据默认进行standardize标准化。
问题2:standardize标准化具体如何执行的,为何要做此操作?
standardize的源码位置详见链接,通过Rcpp调用C++代码,加速计算运行。
# 传入的参数
object1 = data1
object2 = data2
standardize = TRUE
num.cc = 30
verbose = TRUE
# 主要代码
set.seed(seed = 42)
cells1 <- colnames(x = object1)
cells2 <- colnames(x = object2)
if (standardize) {
object1 <- Standardize(mat = object1, display_progress = FALSE)
object2 <- Standardize(mat = object2, display_progress = FALSE)
}
standardize的源码
/* Performs column scaling and/or centering. Equivalent to using scale(mat, TRUE, apply(x,2,sd)) in R.
Note: Doesn't handle NA/NaNs in the same way the R implementation does, */
// [[Rcpp::export(rng = false)]]
NumericMatrix Standardize(Eigen::Map<Eigen::MatrixXd> mat, bool display_progress = true){
Progress p(mat.cols(), display_progress);
NumericMatrix std_mat(mat.rows(), mat.cols());
for(int i=0; i < mat.cols(); ++i){
p.increment();
Eigen::ArrayXd r = mat.col(i).array();
double colMean = r.mean();
double colSdev = sqrt((r - colMean).square().sum() / (mat.rows() - 1));
NumericMatrix::Column new_col = std_mat(_, i);
for(int j=0; j < new_col.size(); j++) {
new_col[j] = (r[j] - colMean) / colSdev;
}
}
return std_mat;
}
怎么理解这个standardize的函数呢?Standardize函数是等价scale(object, TRUE, apply(object, 2, sd))的,对矩阵的每列(每个细胞)的高可变基因求标准差sd,再scale缩放。
standardize函数的缩放和ScaleData函数有何不同?
ScaleData函数是使每个基因在所有细胞的表达量均值为 0,使每个基因在所有细胞中的表达量方差为 1,针对的是单个基因在所有细胞的缩放;
standardize函数的缩放针对的是单个细胞所有高可变基因集的缩放。表达量方差此处不是1,是 每列求得的sd,apply(object, 2, sd);
testthat::expect_equal(Standardize(object1, display_progress = FALSE), scale(object1, TRUE, apply(object1, 2, sd)),check.attributes = FALSE)
dim(scale(object1, TRUE, apply(object1, 2, sd)))
# [1] 1866 6548
length(apply(object1, 2, sd))
# [1] 6548
data1[1:5, 1:5]
object1[1:5, 1:5]
重新回到RunCCA.default程序上看后面的运行,
1)crossprod是计算两个矩阵object1,object2的内积,生成mat3 ,crossprod(a,b)等价于t(a)%*%b,但计算速度更快;
2)调用irlba包,针对大型密集和稀疏矩阵的快速截断奇异值分解和主成分分析的R包。奇异值分解(SVD)的作用,就是提取一个较复杂矩阵中的关键部分*,然后用一个简单的矩阵代表其关键部分,以达到简化的目的。在遇到维度灾难的时候,数据处理常用的降维方法是SVD(奇异值分解)和PCA(主成分分析)。
3)cca.svd$u是第一个样本的特征矩阵,cca.svd$v是第二个样本的特征矩阵;
U, V, S are the matrices corresponding to the estimated left and right singular vectors, and diagonal matrix of estimated singular values, respectively.
通过前面这么多步骤,终于将矩阵2000*Ncells降到30*Ncells,实现了2000维降至30维。
顺便,我们也理解下CCA(Cannonical Correlation Analysis)的定义,Seurat每次只组合一对数据集(两个样本),为每个多维数据集降维到一个较低维的子空间。
mat3 <- crossprod(x = object1, y = object2)
cca.svd <- irlba(A = mat3, nv = num.cc)
cca.data <- rbind(cca.svd$u, cca.svd$v)
colnames(x = cca.data) <- paste0("CC", 1:num.cc)
rownames(cca.data) <- c(cells1, cells2)
cca.data <- apply(
X = cca.data,
MARGIN = 2,
FUN = function(x) {
if (sign(x[1]) == -1) {
x <- x * -1
}
return(x)
}
)
cca.svd$d
# [1] 651320.30 102499.22 94495.34 84746.89 52127.27 49764.53 41900.86
# [8] 38378.88 35422.20 34252.97 32273.63 25540.07 25339.30 23556.11
# [15] 22299.70 22187.58 20179.09 19572.33 18928.07 17757.20 16859.69
# [22] 16709.28 16388.62 16282.70 15720.46 15625.57 15478.67 15203.39
# [29] 14970.87 14797.46
return(list(ccv = cca.data, d = cca.svd$d))
step7:运行完上面的RunCCA.default函数,回到RunCCA.Seurat函数,将上面得到cca矩阵赋值给cca.results。
通过cca.results的整合结果构造了一个combined.object的seurat对象。默认compute.gene.loadings为true,这个操作会调用ProjectDim函数,大致意思是计算每个基因在CC_1到CC_30的权重,方便查看30个维度主要是哪些基因占主要权重。
cca.results <- list(ccv = cca.data, d = cca.svd$d)
combined.object <- merge(
x = object1,
y = object2,
merge.data = TRUE
)
dim(combined.object)
# [1] 2000 13999
rownames(x = cca.results$ccv) <- Cells(x = combined.object)
colnames(x = data1) <- Cells(x = combined.object)[1:ncol(x = data1)]
colnames(x = data2) <- Cells(x = combined.object)[(ncol(x = data1) + 1):length(x = Cells(x = combined.object))]
combined.object[['cca']] <- CreateDimReducObject(
embeddings = cca.results$ccv[colnames(combined.object), ],
assay = assay1,
key = "CC_"
)
combined.object[['cca']]@assay.used <- DefaultAssay(combined.object)
combined.scale <- cbind(data1,data2)
combined.object <- SetAssayData(object = combined.object,new.data = combined.scale, slot = "scale.data")
if (compute.gene.loadings) {
combined.object <- ProjectDim(
object = combined.object,
reduction = "cca",
verbose = FALSE,
overwrite = TRUE)
}
combined.object
# An object of class Seurat
# 2000 features across 13999 samples within 1 assay
# Active assay: ToIntegrate (2000 features, 0 variable features)
# 1 dimensional reduction calculated: cca
new.feature.loadings.full <- data.use %*% cell.embeddings
head(new.feature.loadings.full)
step8:切回到anchoring.fxn函数,将上面得到combined.object赋值给object.pair,另外,默认情况下,是需要对降维的矩阵进行l2.norm归一化处理的。
object.pair <- list(ccv = cca.data, d = cca.svd$d)
if (l2.norm){
object.pair <- L2Dim(object = object.pair, reduction = reduction)
reduction <- paste0(reduction, ".l2")
nn.reduction <- reduction
}
reduction.2 <- character()
问题3:为什么需要对典型相关向量进行 L2-normalization?
Canonical correlation vectors (CCV) project the two datasets into a correlated low-dimensional space, but global differences in scale (for example, differences in normalization between datasets) can still preclude comparing CCV across datasets. To address this, we perform L2-normalization of the cell embeddings.
We perform canonical correlation analysis, followed by L2 normalization of the canonical correlation vectors, to project the datasets into a subspace defined by shared correlation structure across datasets.
典型相关向量 (CCV) 将两个数据集投影到相关的低维空间中,但全局scale差异(例如,数据集之间的归一化差异)仍然可以进行跨数据集的 CCV 比较。为了解决这个问题,我们对细胞嵌入进行 L2 归一化。我们执行典型相关分析,然后对典型相关向量进行 L2 归一化,以将数据集投影到由跨数据集的共享相关结构定义的子空间中。
当批次效应与细胞状态之间的生物学差异具有相似的模式时,为了克服这个问题,我们首先使用对角化 CCA 联合降低两个数据集的维数,然后将 L2 归一化应用于典型相关向量。
step9:最后调用FindAnchors函数,该函数也较复杂,继续调用其他函数。关于此函数,后续再写。
到这里,整个FindIntegrationAnchors函数运行完毕。
anchors <- FindAnchors(
object.pair = object.pair,
assay = c("ToIntegrate", "ToIntegrate"),
slot = slot,
cells1 = colnames(x = object.1),
cells2 = colnames(x = object.2),
internal.neighbors = internal.neighbors,
reduction = reduction,
reduction.2 = reduction.2,
nn.reduction = nn.reduction,
dims = dims,
k.anchor = k.anchor,
k.filter = k.filter,
k.score = k.score,
max.features = max.features,
nn.method = nn.method,
n.trees = n.trees,
eps = eps,
verbose = verbose
)
anchors[, 1] <- anchors[, 1] + offsets[i]
anchors[, 2] <- anchors[, 2] + offsets[j]
return(anchors)
相关资料:
1.https://github.com/satijalab/seurat
2.https://www.cell.com/cell/pdf/S0092-8674(19)30559-8.pdf