参考:https://satijalab.org/seurat/archive/v3.1/pbmc3k_tutorial.html
https://www.jianshu.com/p/26d3648b2e1f
公众号:bioinfomics
构建Seurat对象
本教程使用的是来自10X Genomics平台测序的外周血单核细胞(PBMC)数据集,这个数据集是用Illumina NextSeq 500平台进行测序的,里面包含了2,700个细胞的RNA-seq数据。
这个原始数据是用CellRanger软件进行基因表达定量处理的,最终生成了一个UMI的count矩阵。矩阵里的列是不同barcode指示的细胞,行是测序到的不同基因。接下来,我们将使用这个count矩阵创建Seurat对象
setwd('D:\\genetic_r\\singer-cell-learning\\study-Seurat')
library(Seurat)
library(dplyr)
# 导入PBMC数据,用Read10×这个参数
pbmc.data <- Read10X(data.dir = "D:/genetic_r/singer-cell-learning/study-Seurat/pbmc3k_filtered_gene_bc_matrices/filtered_gene_bc_matrices/hg19")
## 起始的Seurat对象是原始数据,没有经过标准化的
#初步过滤:每个细胞中至少检测到200个基因,每个基因至少在3个细胞中表达
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
pbmc
#An object of class Seurat
#13714 features across 2700 samples within 1 assay
#Active assay: RNA (13714 features, 0 variable features)
查看具体的某个基因在这个矩阵里每个细胞的表达情况
pbmc.data[c("CD3D", "TCL1A", "MS4A1"), 1:30]
3 x 30 sparse Matrix of class "dgCMatrix"
[[ suppressing 30 column names ‘AAACATACAACCAC-1’, ‘AAACATTGAGCTAC-1’, ‘AAACATTGATCAGC-1’ ... ]]
CD3D 4 . 10 . . 1 2 3 1 . . 2 7 1 . . 1 3 . 2 3 . . . . . 3 4 1 5
TCL1A . . . . . . . . 1 . . . . . . . . . . . . 1 . . . . . . . .
MS4A1 . 6 . . . . . . 1 1 1 . . . . . . . . . 36 1 2 . . 2 . . . .
表达矩阵中的"."表示某一基因在某一细胞中没有检测到表达,因为scRNA-seq的表达矩阵中会存在很多表达量为0的数据,Seurat会使用稀疏矩阵进行数据的存储,可以有效减少存储的内存。
标准的数据预处理
质量控制,选择需要分析的细胞
在Seurat中可以使用PercentageFeatureSet函数计算每个细胞中线粒体的含量:在人类参考基因中线粒体基因是以“MT-”开头的,而在小鼠中是以“mt-”开头的。
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
head(pbmc@meta.data, 5)
orig.ident nCount_RNA nFeature_RNA percent.mt
AAACATACAACCAC-1 pbmc3k 2419 779 3.0177759
AAACATTGAGCTAC-1 pbmc3k 4903 1352 3.7935958
AAACATTGATCAGC-1 pbmc3k 3147 1129 0.8897363
AAACCGTGCTTCCG-1 pbmc3k 2639 960 1.7430845
AAACCGTGTATGCG-1 pbmc3k 980 521 1.2244898
可视化QC指标,并用它们来过滤细胞:
1)将unique基因count数超过2500,或者小于200的细胞过滤掉
2)把线粒体含量超过5%以上的细胞过滤掉
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
nFeature_RNA代表每个细胞测到的基因数目
nCount_RNA代表每个细胞测到所有基因的表达量之和。
percent.mt代表测到的线粒体基因的比例
我们还可以使用FeatureScatter函数来对不同特征-特征之间的关系进行可视化
plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2
根据QC指标进行细胞和基因的过滤
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
数据的归一化
默认情况下,Seurat使用global-scaling的归一化方法,称为“LogNormalize”,这种方法是利用总的表达量对每个细胞里的基因表达值进行归一化,乘以一个scale factor(默认值是10000),再用log转换一下。归一化后的数据存放在pbmc[["RNA"]]@data里。
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
鉴定高可变基因
Seurat使用FindVariableFeatures函数鉴定高可变基因,这些基因在PBMC不同细胞之间的表达量差异很大(在一些细胞中高表达,在另一些细胞中低表达)。默认情况下,会返回2,000个高可变基因用于下游的分析,如PCA等。
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
top10 <- head(VariableFeatures(pbmc), 10)
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
plot1 + plot2
注意这里绘图需要将Rstudio的绘图区拉到最高(尽量高,否则会报错)
数据的标准化
Seurat使用ScaleData函数对归一化后的count矩阵进行一个线性的变换(“scaling”),将数据进行标准化:
1)shifting每个基因的表达,使细胞间的平均表达为0
2)scaling每个基因的表达,使细胞间的差异为1
ScaleData默认对之前鉴定到的2000个高可变基因进行标准化,也可以通过vars.to.regress参数指定其他的变量对数据进行标准化,表达矩阵进行scaling后,其结果存储在pbmc[["RNA"]]@scale.data中。
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
PCA线性降维
Seurat使用RunPCA函数对标准化后的表达矩阵进行PCA降维处理,默认情况下,只对前面选出的2000个高可变基因进行线性降维,也可以通过feature参数指定想要降维的数据集。
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
# Examine and visualize PCA results a few different ways
print(pbmc[["pca"]], dims = 1:5, nfeatures = 5)
PC_ 1
Positive: CST3, TYROBP, LST1, AIF1, FTL
Negative: MALAT1, LTB, IL32, IL7R, CD2
PC_ 2
Positive: CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1
Negative: NKG7, PRF1, CST7, GZMB, GZMA
PC_ 3
Positive: HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1
Negative: PPBP, PF4, SDPR, SPARC, GNG11
PC_ 4
Positive: HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1
Negative: VIM, IL7R, S100A6, IL32, S100A8
PC_ 5
Positive: GZMB, NKG7, S100A8, FGFBP2, GNLY
Negative: LTB, IL7R, CKB, VIM, MS4A7
Seurat可以使用VizDimReduction, DimPlot, 和DimHeatmap函数对PCA的结果进行可视化
VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")
DimPlot(pbmc, reduction = "pca")
DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE)
DimHeatmap(pbmc, dims = 1:6, cells = 500, balanced = TRUE)
选择PCA降维的维数用于后续的分析
Seurat可以使用两种方法确定PCA降维的维数用于后续的聚类分析:
- 使用JackStrawPlot函数
使用JackStraw函数计算每个PC的P值的分布,显著的PC会有较低的p-value
pbmc <- JackStraw(pbmc, num.replicate = 100)
pbmc <- ScoreJackStraw(pbmc, dims = 1:20)
# 使用JackStrawPlot函数进行可视化
JackStrawPlot(pbmc, dims = 1:15)
- 使用ElbowPlot函数
使用ElbowPlot函数查看在哪一个PC处出现平滑的挂点:
ElbowPlot(pbmc)
这里我们选择了10,但是建议在用户分析自己的数据时,选择不同的pc数来重复下有分析(10,15,甚至50),也不要选择过少的pc数,会影响下游的分析效果。
细胞的聚类分群
pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 2638
Number of edges: 95965
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.8723
Number of communities: 9
Elapsed time: 0 seconds
head(Idents(pbmc), 5)
AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1
2 3 2 1
AAACCGTGTATGCG-1
6
Levels: 0 1 2 3 4 5 6 7 8
非线性降维可视化(UMAP/tSNE)
Seurat提供了几种非线性的降维技术,如tSNE和UMAP。这些算法的目标是在低维空间中将相似的细胞聚在一起。作为tSNE和UMAP的input,建议使用与聚类分析的输入相同的PCs作为输入。
# UMAP降维可视化
pbmc <- RunUMAP(pbmc, dims = 1:10)
# note that you can set `label = TRUE` or use the LabelClusters function to help label individual clusters
DimPlot(pbmc, reduction = "umap")
#tSNE降维可视化
pbmc <- RunTSNE(pbmc, dims = 1:10)
DimPlot(pbmc, reduction = "tsne", label = TRUE)
寻找差异表达基因
Seurat使用FindMarkers和FindAllMarkers函数进行差异表达基因的筛选,可以通过test.use参数指定不同的差异表达基因筛选的方法。
#在cluster1里寻找marker
#min.pct参数定义了一个基因在两组细胞任意一组里的最低可检测到的百分率
cluster1.markers <- FindMarkers(pbmc, ident.1 = 1, min.pct = 0.25)
colnames(cluster1.markers)[2] <- 'avg_logFC'
head(cluster1.markers, n = 5)
p_val avg_logFC pct.1 pct.2 p_val_adj
S100A9 0.000000e+00 5.570063 0.996 0.215 0.000000e+00
S100A8 0.000000e+00 5.477394 0.975 0.121 0.000000e+00
FCN1 0.000000e+00 3.394219 0.952 0.151 0.000000e+00
LGALS2 0.000000e+00 3.800484 0.908 0.059 0.000000e+00
CD14 2.856582e-294 2.815626 0.667 0.028 3.917516e-290
#寻找cluster5与cluster0和3之间的差异maker
cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)
colnames(cluster1.markers)[2] <- 'avg_logFC'
head(cluster5.markers, n = 5)
p_val avg_logFC pct.1 pct.2 p_val_adj
FCGR3A 2.150929e-209 4.267579 0.975 0.039 2.949784e-205
IFITM3 6.103366e-199 3.877105 0.975 0.048 8.370156e-195
CFD 8.891428e-198 3.411039 0.938 0.037 1.219370e-193
CD68 2.374425e-194 3.014535 0.926 0.035 3.256286e-190
RP11-290F20.3 9.308287e-191 2.722684 0.840 0.016 1.276538e-186
寻找每一个cluster里与其它所有细胞相比制厚朴的差异marker
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_log2FC)
# A tibble: 18 x 7
# Groups: cluster [9]
p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
<dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
1 1.74e-109 1.07 0.897 0.593 2.39e-105 0 LDHB
2 1.17e- 83 1.33 0.435 0.108 1.60e- 79 0 CCR7
3 0. 5.57 0.996 0.215 0. 1 S100A9
4 0. 5.48 0.975 0.121 0. 1 S100A8
5 7.99e- 87 1.28 0.981 0.644 1.10e- 82 2 LTB
6 2.61e- 59 1.24 0.424 0.111 3.58e- 55 2 AQP3
7 0. 4.31 0.936 0.041 0. 3 CD79A
8 9.48e-271 3.59 0.622 0.022 1.30e-266 3 TCL1A
9 1.17e-178 2.97 0.957 0.241 1.60e-174 4 CCL5
10 4.93e-169 3.01 0.595 0.056 6.76e-165 4 GZMK
11 3.51e-184 3.31 0.975 0.134 4.82e-180 5 FCGR3A
12 2.03e-125 3.09 1 0.315 2.78e-121 5 LST1
13 1.05e-265 4.89 0.986 0.071 1.44e-261 6 GZMB
14 6.82e-175 4.92 0.958 0.135 9.36e-171 6 GNLY
15 1.48e-220 3.87 0.812 0.011 2.03e-216 7 FCER1A
16 1.67e- 21 2.87 1 0.513 2.28e- 17 7 HLA-DPB1
17 7.73e-200 7.24 1 0.01 1.06e-195 8 PF4
18 3.68e-110 8.58 1 0.024 5.05e-106 8 PPBP
Marker基因的可视化
VlnPlot(pbmc, features = c("MS4A1", "CD79A"))
# you can plot raw counts as well
VlnPlot(pbmc, features = c("NKG7", "PF4"), slot = "counts", log = TRUE)
FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A"))
RidgePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A"))
DotPlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A"))
#选取每一个cluster里的top20的marker画热图
top10 <- pbmc.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_log2FC)
DoHeatmap(pbmc, features = top10$gene) + NoLegend()
对聚类后的不同类群进行注释
我们可以基于已有的生物学知识,根据一些特异的marker基因不细胞类群进行注释,以下是根据PBMC中不同细胞类群的一些marker基因进行细胞分群注释的结果new.cluster.ids <- c("Naive CD4 T", "Memory CD4 T", "CD14+ Mono", "B", "CD8 T", "FCGR3A+ Mono", "NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
# 细胞分群的重命名
pbmc <- RenameIdents(pbmc, new.cluster.ids)
DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()
# 保存分析的结果
saveRDS(pbmc, file = "pbmc3k_final.rds")
library(Seurat)
rm(list=ls())
library(dplyr)
library(dplyr)
library(patchwork)
packageVersion("Seurat")
# 导入PBMC数据,用Read10×这个参数
pbmc.data <- Read10X(data.dir = "D:/genetic_r/singer-cell-learning/study-Seurat/pbmc3k_filtered_gene_bc_matrices/filtered_gene_bc_matrices/hg19")
## 起始的Seurat对象是原始数据,没有经过标准化的
#初步过滤:每个细胞中至少检测到200个基因,每个基因至少在3个细胞中表达
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
pbmc
#具体的某个基因在这个矩阵里每个细胞的表达情况
pbmc.data[c("CD3D", "TCL1A", "MS4A1"), 1:30]
pbmc.data[c("CD3D", "TCL1A", "MS4A1"), 1:2]
#表达矩阵中的"."表示某一基因在某一细胞中没有检测到表达,因为scRNA-seq的表达矩阵中会存在很多表达量为0的数据,Seurat会使用稀疏矩阵进行数据的存储,可以有效减少存储的内存。
#过滤线粒体基因
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
head(pbmc@meta.data, 5)
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2
#根据QC指标进行细胞和基因的过滤
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
#归一化
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
鉴定高可变基因
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
top10 <- head(VariableFeatures(pbmc), 10)
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
plot1 + plot2
数据的标准化
pbmc <- ScaleData(pbmc)
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
进行PCA线性降维
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
# Examine and visualize PCA results a few different ways
print(pbmc[["pca"]], dims = 1:5, nfeatures = 5)
可视化
VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")
DimPlot(pbmc, reduction = "pca")
DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE)
DimHeatmap(pbmc, dims = 1:15, cells = 500, balanced = TRUE)
选择PCA降维的维数用于后续的分析
pbmc <- JackStraw(pbmc, num.replicate = 100)
pbmc <- ScoreJackStraw(pbmc, dims = 1:20)
# 使用JackStrawPlot函数进行可视化
JackStrawPlot(pbmc, dims = 1:15)
ElbowPlot(pbmc)
#细胞聚类
pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)
head(Idents(pbmc), 5)
pbmc <- RunUMAP(pbmc, dims = 1:10)
DimPlot(pbmc, reduction = "umap")
# UMAP降维可视化
pbmc <- RunUMAP(pbmc, dims = 1:10)
# note that you can set `label = TRUE` or use the LabelClusters function to help label individual clusters
DimPlot(pbmc, reduction = "umap")
#tSNE降维可视化
pbmc <- RunTSNE(pbmc, dims = 1:10)
DimPlot(pbmc, reduction = "tsne", label = TRUE)
寻找差异表达基因
#在cluster1里寻找marker
#min.pct参数定义了一个基因在两组细胞任意一组里的最低可检测到的百分率
cluster1.markers <- FindMarkers(pbmc, ident.1 = 1, min.pct = 0.25)
colnames(cluster1.markers)[2] <- 'avg_logFC'
head(cluster1.markers, n = 5)
#寻找cluster5与cluster0和3之间的差异maker
cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)
head(cluster5.markers, n = 5)
#寻找每一个cluster里与其它所有细胞相比制厚朴的差异marker
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_log2FC)
####Marker基因的可视化
VlnPlot(pbmc, features = c("MS4A1", "CD79A"))
VlnPlot(pbmc, features = c("NKG7", "PF4"), slot = "counts", log = TRUE)
FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A"))
RidgePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A"))
DotPlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A"))
#选取每一个cluster里的top20的marker画热图
top10 <- pbmc.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_log2FC)
DoHeatmap(pbmc, features = top10$gene) + NoLegend()
new.cluster.ids <- c("Naive CD4 T", "Memory CD4 T", "CD14+ Mono", "B", "CD8 T", "FCGR3A+ Mono", "NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
# 细胞分群的重命名
pbmc <- RenameIdents(pbmc, new.cluster.ids)
DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()
# 保存分析的结果
saveRDS(pbmc, file = "pbmc3k_final.rds")