作为纪念开个专栏:每周精选Seurat社区有趣的问答以飨国内Seurat的用户。
为什么要做这个事情呢?
- 翻译学习法
- 认知学习法
- 通勤学习法
- 追踪工具动态
- 相信我,你并不孤独
因为国内的单细胞数据分析人员越来越多,而Seurat为单细胞数据分析提供了一个很好的思考框架,当然Seurat发表的文章有很多。大家在应用这个单细胞数据分析工具的时候不免会遇到这样或那样的问题,这虽然是让人苦恼的,但是以为伟人说过:好的问题要比好的答案价值百倍。于是,我在Seurat的github上watch了所有的问题,所以我的个人邮箱是这样的:
所以在通勤路上看看大家都踩了Seurat的哪些坑,Seurat社区成员又是如何回复的,不能不说是一件有意思的事儿。何况还能学习一下英语啦_.
计划是每周精选不多于10条的Issues,在翻译和复现这些issue的时候也会掺杂一些自己对Seurat的学习心得,以促进Seurat学者了解这个项目。
目前github上面大部分的交流用的还是英语,我们为什么不用汉语在上面交流呢?那样就达不到学习英语的目的了吧。
凡事善始者实繁,克终者盖寡。Seurat Weekly能做几期呢?我们不能承诺,毕竟是一项无人资助的事业,看缘分呢。
最为NO.0,今天主要探索一下选题与排版的一般规则。
选题的第一个规则就是某个话题出现的频率,我们相信一个有很多人愿意讨论的话题更大概率是一个好问题。其次是主编本人觉得有意思的问题。
至于排版,我们按照一般的FAQ的形式给出,问题部分我们给出原文描述,答案就是主编在看过别人的回答之后,夹杂着个人观点的品论。
每个问题,我们也给出在github的链接,方便读者朋友就原问题进行解答。
下面走几个例子:
Getting parameters from pre-existing reduction (umap) (#3053)
github地址: https://github.com/satijalab/seurat/issues/3053
这位oh111 的意思应该是手里有了Seurat对象,想知道这个对象之前的作者是如何操作的。其实Seurat是给了每个操作过程的参数记录的:
就拿我们seurat自带的数据集来看吧:
Seurat::pbmc_small@commands
这返回的是一个list,他说想看PCA的参数,其实这样就可以了:
Seurat::pbmc_small@commands$RunPCA.RNA
Command: RunPCA(object = pbmc_small, features = VariableFeatures(object = pbmc_small), verbose = FALSE)
Time: 2018-08-28 04:34:56
assay : RNA
features : PPBP IGLL5 VDAC3 CD1C AKR1C3 PF4 MYL9 GNLY TREML1 CA2 SDPR PGRMC1 S100A8 TUBB1 HLA-DQA1 PARVB RUFY1 HLA-DPB1 RP11-290F20.3 S100A9
compute.dims : 20
rev.pca : FALSE
weight.by.var : TRUE
verbose : FALSE
print.dims : 1 2 3 4 5
features.print : 30
reduction.name : pca
reduction.key : PC
seed.use : 42
Reorder cells by expression value in DoHeatmap & Dendogram (#3036)
github 地址:https://github.com/satijalab/seurat/issues/3036
这是一个可视化细节的问题,有时候我们用默认参数出来的图并不能直接达到发表的水平,或者自己有些小心思想改一下这个图,最常见的就是该标签的顺序。其实我们知道这个一般是通过因子变量的levels来控制的。
有位仁兄给出了这个链接: https://stackoverflow.com/questions/52136211/how-to-reorder-cells-in-doheatmap-plot-in-seurat-ggplot2
那么我们来看一下Seurat的绘图是如何实现的,看DoHeatmapd的源代码:
DoHeatmap
function (object, features = NULL, cells = NULL, group.by = "ident",
group.bar = TRUE, group.colors = NULL, disp.min = -2.5, disp.max = NULL,
slot = "scale.data", assay = NULL, label = TRUE, size = 5.5,
hjust = 0, angle = 45, raster = TRUE, draw.lines = TRUE,
lines.width = NULL, group.bar.height = 0.02, combine = TRUE)
{
cells <- cells %||% colnames(x = object)
if (is.numeric(x = cells)) {
cells <- colnames(x = object)[cells]
}
assay <- assay %||% DefaultAssay(object = object)
DefaultAssay(object = object) <- assay
features <- features %||% VariableFeatures(object = object)
features <- rev(x = unique(x = features))
disp.max <- disp.max %||% ifelse(test = slot == "scale.data",
yes = 2.5, no = 6)
possible.features <- rownames(x = GetAssayData(object = object,
slot = slot))
if (any(!features %in% possible.features)) {
bad.features <- features[!features %in% possible.features]
features <- features[features %in% possible.features]
if (length(x = features) == 0) {
stop("No requested features found in the ", slot,
" slot for the ", assay, " assay.")
}
warning("The following features were omitted as they were not found in the ",
slot, " slot for the ", assay, " assay: ", paste(bad.features,
collapse = ", "))
}
data <- as.data.frame(x = as.matrix(x = t(x = GetAssayData(object = object,
slot = slot)[features, cells, drop = FALSE])))
object <- suppressMessages(expr = StashIdent(object = object,
save.name = "ident"))
group.by <- group.by %||% "ident"
groups.use <- object[[group.by]][cells, , drop = FALSE]
plots <- vector(mode = "list", length = ncol(x = groups.use))
for (i in 1:ncol(x = groups.use)) {
data.group <- data
group.use <- groups.use[, i, drop = TRUE]
if (!is.factor(x = group.use)) {
group.use <- factor(x = group.use)
}
names(x = group.use) <- cells
if (draw.lines) {
lines.width <- lines.width %||% ceiling(x = nrow(x = data.group) *
0.0025)
placeholder.cells <- sapply(X = 1:(length(x = levels(x = group.use)) *
lines.width), FUN = function(x) {
return(RandomName(length = 20))
})
placeholder.groups <- rep(x = levels(x = group.use),
times = lines.width)
group.levels <- levels(x = group.use)
names(x = placeholder.groups) <- placeholder.cells
group.use <- as.vector(x = group.use)
names(x = group.use) <- cells
group.use <- factor(x = c(group.use, placeholder.groups),
levels = group.levels)
na.data.group <- matrix(data = NA, nrow = length(x = placeholder.cells),
ncol = ncol(x = data.group), dimnames = list(placeholder.cells,
colnames(x = data.group)))
data.group <- rbind(data.group, na.data.group)
}
lgroup <- length(levels(group.use))
plot <- SingleRasterMap(data = data.group, raster = raster,
disp.min = disp.min, disp.max = disp.max, feature.order = features,
cell.order = names(x = sort(x = group.use)), group.by = group.use)
if (group.bar) {
default.colors <- c(hue_pal()(length(x = levels(x = group.use))))
cols <- group.colors[1:length(x = levels(x = group.use))] %||%
default.colors
if (any(is.na(x = cols))) {
cols[is.na(x = cols)] <- default.colors[is.na(x = cols)]
cols <- Col2Hex(cols)
col.dups <- sort(x = unique(x = which(x = duplicated(x = substr(x = cols,
start = 1, stop = 7)))))
through <- length(x = default.colors)
while (length(x = col.dups) > 0) {
pal.max <- length(x = col.dups) + through
cols.extra <- hue_pal()(pal.max)[(through +
1):pal.max]
cols[col.dups] <- cols.extra
col.dups <- sort(x = unique(x = which(x = duplicated(x = substr(x = cols,
start = 1, stop = 7)))))
}
}
group.use2 <- sort(x = group.use)
if (draw.lines) {
na.group <- RandomName(length = 20)
levels(x = group.use2) <- c(levels(x = group.use2),
na.group)
group.use2[placeholder.cells] <- na.group
cols <- c(cols, "#FFFFFF")
}
pbuild <- ggplot_build(plot = plot)
names(x = cols) <- levels(x = group.use2)
y.range <- diff(x = pbuild$layout$panel_params[[1]]$y.range)
y.pos <- max(pbuild$layout$panel_params[[1]]$y.range) +
y.range * 0.015
y.max <- y.pos + group.bar.height * y.range
plot <- plot + annotation_raster(raster = t(x = cols[group.use2]),
xmin = -Inf, xmax = Inf, ymin = y.pos, ymax = y.max) +
coord_cartesian(ylim = c(0, y.max), clip = "off") +
scale_color_manual(values = cols)
if (label) {
x.max <- max(pbuild$layout$panel_params[[1]]$x.range)
x.divs <- pbuild$layout$panel_params[[1]]$x.major
x <- data.frame(group = sort(x = group.use),
x = x.divs)
label.x.pos <- tapply(X = x$x, INDEX = x$group,
FUN = median) * x.max
label.x.pos <- data.frame(group = names(x = label.x.pos),
label.x.pos)
plot <- plot + geom_text(stat = "identity", data = label.x.pos,
aes_string(label = "group", x = "label.x.pos"),
y = y.max + y.max * 0.03 * 0.5, angle = angle,
hjust = hjust, size = size)
plot <- suppressMessages(plot + coord_cartesian(ylim = c(0,
y.max + y.max * 0.002 * max(nchar(x = levels(x = group.use))) *
size), clip = "off"))
}
}
plot <- plot + theme(line = element_blank())
plots[[i]] <- plot
}
if (combine) {
plots <- CombinePlots(plots = plots)
}
return(plots)
}
<bytecode: 0x11837bb8>
<environment: namespace:Seurat>
我们发现Seurat的图是用ggplot2实现的,这样我们就可以基于函数的返回对象来用自己的ggplot功底修复了:
require(Seurat)
p <- DoHeatmap(pbmc_small)
p$theme
p$layers
p$mapping
head(p$data)
Feature Cell Expression Identity
1 S100A9 ATGCCAGAACGACT -0.7639656 0
2 RP11-290F20.3 ATGCCAGAACGACT -0.3730316 0
3 HLA-DPB1 ATGCCAGAACGACT -1.0399941 0
4 RUFY1 ATGCCAGAACGACT -0.4098329 0
5 PARVB ATGCCAGAACGACT -0.3461794 0
6 HLA-DQA1 ATGCCAGAACGACT -0.6717070 0
Time series dataset in two conditions #3040
github 地址:https://github.com/satijalab/seurat/issues/3040
这是一个常见的问题,多个样本在许多地方还是值得探讨的,这位回答者直接给出了两篇参考文献:
I would recommend that you review the Guided tutorial, Multiple dataset integration (specifically the SCT method), and the Stimulated vs Control PBMCs, in that order. Due the large probable size of your final dataset, I think that the manipulations found within the Mouse Cell Atlas Vignette might also prove useful. The VIsualization and Cell cycle regression vignettes were also particularly helpful for our analysis of similar conditions. Lastly, I recommend that you review the methods in Farnsworth et al. 2020 as well as Soldatov et al., 2019 for further help. Hope that this was useful.
Optimal strategy to process samples to compare two different condition. (#3019)
这个其实和上个问题比较像,都是处理多个样本,这里也有人士给出:
I wouldn't recommend using cellranger aggr
as it will downsample everything to a similar number of counts as the sample with the lowest counts, and so potentially will throw out a lot of data. You could quantify each dataset (using cellranger, or another tool like Alevin) and first merge them in Seurat and check if there are batch differences between the datasets. If there are, you could run the integration methods
Deploying Shiny apps using Seurat library to shinyapps.io #2716
github 地址: https://github.com/satijalab/seurat/issues/2716
这是一类问题属于对Seurat的扩展,这里是想把Seurat扩展到Shiny程序中,也就是给他界面化。这要求再开发者不单要对Seurat的函数和数据对象及其依赖的R包和环境有所了解,还要求他要懂得Shiny的语法和结构。其实这个问题下的讨论多是想要弥补Seurat与Shiny的界限。
好了,今天就到这里,感谢大家的陪伴。对了,你在github上面提问了没?