刘小泽写于19.4.5
许多单细胞分析包中都会遇到这个对象,那么说明书当然是最好的学习工具了
说明书地址:https://bioc.ism.ac.jp/packages/3.6/bioc/vignettes/SingleCellExperiment/inst/doc/intro.html
安装
options(BioC_mirror="https://mirrors.ustc.edu.cn/bioc/")
library(BiocManager)
BiocInstaller::biocLite("SingleCellExperiment")
BiocInstaller::biocLite("scRNAseq")
两种创建方式
# 第一种
> counts <- matrix(rpois(100, lambda = 10), ncol=10, nrow=10)
> sce <- SingleCellExperiment(assays = list(counts = counts))
> sce
class: SingleCellExperiment
dim: 10 10
metadata(0):
assays(1): counts
rownames: NULL
rowData names(0):
colnames: NULL
colData names(0):
reducedDimNames(0):
spikeNames(0):
# 第二种
> se <- SummarizedExperiment(list(counts=counts))
> as(se, "SingleCellExperiment")
演示数据集
使用scRNAseq 中的allen
数据集
> data(allen)
> allen
class: SummarizedExperiment
dim: 20908 379
metadata(2): SuppInfo which_qc
assays(4): tophat_counts cufflinks_fpkm rsem_counts rsem_tpm
rownames(20908): 0610007P14Rik 0610009B22Rik ... Zzef1 Zzz3
rowData names(0):
colnames(379): SRR2140028 SRR2140022 ... SRR2139341 SRR2139336
colData names(22): NREADS NALIGNED ... Animal.ID passes_qc_checks_s
现在是一个SummarizedExperiment
,那么使用第二种方法:
> sce <- as(allen, "SingleCellExperiment")
> sce
class: SingleCellExperiment
dim: 20908 379
metadata(2): SuppInfo which_qc
assays(4): tophat_counts cufflinks_fpkm rsem_counts rsem_tpm
rownames(20908): 0610007P14Rik 0610009B22Rik ... Zzef1 Zzz3
rowData names(0):
colnames(379): SRR2140028 SRR2140022 ... SRR2139341 SRR2139336
colData names(22): NREADS NALIGNED ... Animal.ID passes_qc_checks_s
reducedDimNames(0):
spikeNames(0):
可以对SingleCellExperiment进行的操作
添加spike-in信息
#代码的意思很清楚:先提取 rownames(sce)中的ERCC,结果返回逻辑值,然后传递给isSpike,如果结果是TRUE的部分,就在sce中记做”ERCC“
> isSpike(sce, "ERCC") <- grepl("^ERCC-", rownames(sce))
> sce
class: SingleCellExperiment
dim: 20908 379
metadata(2): SuppInfo which_qc
assays(4): tophat_counts cufflinks_fpkm rsem_counts rsem_tpm
rownames(20908): 0610007P14Rik 0610009B22Rik ... Zzef1 Zzz3
rowData names(0):
colnames(379): SRR2140028 SRR2140022 ... SRR2139341 SRR2139336
colData names(22): NREADS NALIGNED ... Animal.ID passes_qc_checks_s
reducedDimNames(0):
spikeNames(1): ERCC
有了这个信息,我们也可以非常方便进行提取,包括数量和名称信息
> table(isSpike(sce, "ERCC")) #提取数量
FALSE TRUE
20816 92
> spikeNames(sce) # 提取名称
[1] "ERCC"
虽然大多数实验中只用到一个spike-in信息,但这个函数还可以继续加spike-in,比如要添加adam
基因家族作为spike-in
> isSpike(sce, "Adam") <- grepl("^Adam[0-9]", rownames(sce)) #添加新spike-in
> sce
class: SingleCellExperiment
dim: 20908 379
metadata(2): SuppInfo which_qc
assays(4): tophat_counts cufflinks_fpkm rsem_counts rsem_tpm
rownames(20908): 0610007P14Rik 0610009B22Rik ... Zzef1 Zzz3
rowData names(0):
colnames(379): SRR2140028 SRR2140022 ... SRR2139341 SRR2139336
colData names(22): NREADS NALIGNED ... Animal.ID passes_qc_checks_s
reducedDimNames(0):
spikeNames(2): ERCC Adam
> table(isSpike(sce, "Adam"))
FALSE TRUE
20875 33
要取消之前定义的spike-in也很简单
> isSpike(temp,"ERCC") <- NULL
> isSpike(temp,"Adam") <- NULL
> spikeNames(temp)
character(0)
添加Size factors
一般就是计算总reads数作为size factors,更复杂的计算情况可以参考:scran
> sizeFactors(sce) <- colSums(assay(sce))
> head(sizeFactors(sce))
SRR2140028 SRR2140022 SRR2140055 SRR2140083 SRR2139991 SRR2140067
5173863 6445002 2343379 5438526 4757468 2364851
可以根据其他信息计算size factors并存储在一个对象中,然后给他一个名称,比如这里要计算ERCC的size factors
> sizeFactors(sce, "ERCC") <- colSums(assay(sce)[isSpike(sce, "ERCC"),])
# 那么要看ERCC的size factors就要指定ERCC名称。不指定还是默认看之前的
> head(sizeFactors(sce, "ERCC"))
SRR2140028 SRR2140022 SRR2140055 SRR2140083 SRR2139991 SRR2140067
224648 186208 162370 512991 278034 64975
提取colData和rowData
colData
and rowData
分别存储样本和基因信息
# 简单的提取
colData(sce)
rowData(sce)
# 如果想提取出size facotr
colData(sce,internal = T)
# 如果想提取出spike-in
rowData(sce,internal = T)
对研究的基因数量降维(取子集)
一般表达矩阵都有一万多基因,为了分析的简便和速度,可以提取前100基因,但是这也不是随便选取的。它的要求是:log转换后最大方差的前100
> library(magrittr)
> assay(sce) %>% log1p %>% rowVars -> vars
# log1p(x) = log(1+x);rowVars:each row variance in a matrix
> names(vars) <- rownames(sce)
> vars <- sort(vars, decreasing = TRUE) # 为了找到前100的名字
> sce_sub <- sce[names(vars[1:100]),]
> sce_sub
class: SingleCellExperiment
dim: 100 379
metadata(2): SuppInfo which_qc
assays(4): tophat_counts cufflinks_fpkm rsem_counts rsem_tpm
rownames(100): Lamp5 Fam19a1 ... Rnf2 Zfp35
rowData names(0):
colnames(379): SRR2140028 SRR2140022 ... SRR2139341 SRR2139336
colData names(22): NREADS NALIGNED ... Animal.ID passes_qc_checks_s
reducedDimNames(0):
spikeNames(2): ERCC Adam
看到reducedDimNames
还是空值,因此下一步就是利用得到的100个方差最大的基因求PCA和tSNE
> library(Rtsne)
> set.seed(5252)
> pca_data <- prcomp(t(log1p(assay(sce_sub))))
> tsne_data <- Rtsne(pca_data$x[,1:50], pca = FALSE) ## 选取前50个主成分;之前做过了PCA,这里就F(默认T)
# t-SNE for visualization, PCA for pseudo-time inference
# reducedDims就像assays一样,可以存储许多不同维度的数据
> reducedDims(sce_sub) <- SimpleList(PCA=pca_data$x, TSNE=tsne_data$Y)
> sce_sub
class: SingleCellExperiment
dim: 100 379
metadata(2): SuppInfo which_qc
assays(4): tophat_counts cufflinks_fpkm rsem_counts rsem_tpm
rownames(100): Lamp5 Fam19a1 ... Rnf2 Zfp35
rowData names(0):
colnames(379): SRR2140028 SRR2140022 ... SRR2139341 SRR2139336
colData names(22): NREADS NALIGNED ... Animal.ID passes_qc_checks_s
reducedDimNames(2): PCA TSNE
spikeNames(2): ERCC Adam
reducedDimNames
存储的是坐标,可以按名称或index提取。每一行表示一个基因,每一列表示一个坐标
> reducedDims(sce_sub)
List of length 2
names(2): PCA TSNE
> reducedDimNames(sce_sub)
[1] "PCA" "TSNE"
> head(reducedDim(sce_sub, "PCA")[,1:2])
PC1 PC2
SRR2140028 17.557295 -7.717162
SRR2140022 21.468975 -1.198212
SRR2140055 4.303756 -11.360330
SRR2140083 21.440479 -9.435868
SRR2139991 15.592089 -11.043989
SRR2140067 16.539336 -9.831779
> head(reducedDim(sce_sub, "TSNE")[,1:2])
[,1] [,2]
SRR2140028 -4.547370 -16.76029
SRR2140022 -4.957524 -12.40431
SRR2140055 2.046637 -15.23982
SRR2140083 -3.609021 -14.32577
SRR2139991 -2.158066 -13.48076
SRR2140067 -2.361944 -15.70875
对sce_sub
取子集也就是直接对细胞降维的结果取子集
> dim(reducedDim(sce_sub, "PCA"))
[1] 379 100
> dim(reducedDim(sce_sub[,1:10], "PCA"))
[1] 10 100
快速命名assays并获取
SingleCellExperiment
对象中,可以随意命名assays的entry。但是为了同其他包保持协调,官方给出了建议:
-
counts
: 原始表达量,比如某个基因的的reads数或转录本数 -
normcounts
: 利用原始表达量进行标准化,结果是原始值按一定比例缩小后得到的(中心化)。例如原始表达量除以每个细胞特定的size factor得到的就是中心化数据 -
logcounts
: log转换后的counts值,多数情况下被定义为log-transformednormcounts
, 例如log2(count+1)
-
cpm
: Counts-per-million. 每个细胞中每个基因的read count值,再除以每个细胞的文库大小(单位是M) -
tpm
: Transcripts-per-million. 每个细胞中每个基因的转录本数量,再除以每个细胞的总转录本数(单位是M)
# 这里将counts这个entry,定义成快捷访问tophat_counts
> counts(sce) <- assay(sce, "tophat_counts")
> sce
class: SingleCellExperiment
dim: 20908 379
metadata(2): SuppInfo which_qc
assays(5): tophat_counts cufflinks_fpkm rsem_counts rsem_tpm counts
rownames(20908): 0610007P14Rik 0610009B22Rik ... Zzef1 Zzz3
rowData names(0):
colnames(379): SRR2140028 SRR2140022 ... SRR2139341 SRR2139336
colData names(22): NREADS NALIGNED ... Animal.ID passes_qc_checks_s
reducedDimNames(0):
spikeNames(2): ERCC Adam
这样以后直接用counts
提取出tophat_counts
,而不用写后面一大串
欢迎关注我们的公众号~_~
我们是两个农转生信的小硕,打造生信星球,想让它成为一个不拽术语、通俗易懂的生信知识平台。需要帮助或提出意见请后台留言或发送邮件到jieandze1314@gmail.com