单细胞转录组分析---seurat 1例子

==================安装seurant================

install.packages('Seurat')

=====测试例子：pbmc3k_filtered_gene_bc_matrices.tar.gz==

library(Seurat)

library(dplyr)

library(Matrix)

pbmc.data <- Read10X(data.dir = "filtered_gene_bc_matrices/hg19/")

dense.size <- object.size(x = as.matrix(x = pbmc.data))

sparse.size <- object.size(x = pbmc.data)

dense.size/sparse.size

# Initialize the Seurat objectwiththeraw(non-normalized data).

Keep all# genes expressed in>=3 cells(~0.1%of the data).Keep all cells with at# least 200 detected genes

pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)

pbmc

#An object of class Seurat

#13714 features across 2700 samples within 1 assay

#Active assay: RNA (13714 features, 0 variable features)

=========QC======================

pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern ="^MT-")

# Visualize QC metrics as a violin plot

VlnPlot(pbmc, features = c("nFeature_RNA","nCount_RNA","percent.mt"), ncol =3)

#nFeature_RNA：代表的是在该细胞中共检测到的表达量大于0的基因个数，nCount_RNA：代表的是该细胞中所有基因的表达量之和，percent.mt：代表的是线粒体基因表达量的百分比，通过小提琴图来展示

图中每个点代表的是一个细胞，反应的是对应特征在所有细胞的一个分布情况。通过观察上图，我们可以确定一个大概的筛选范围。以nFeature_RNA为例，可以看到数值在2000以上的细胞是非常少的，可以看做是离群值，所以在筛选时，如果一个细胞中检测到的基因个数大于2000，就可以进行过滤。

在过滤阈值时，我们还需要考虑一个因素，就是这3个指标之间的相互关系，可以通过以下方式得到：

plot1 <- FeatureScatter(pbmc, feature1 ="nCount_RNA", feature2 ="percent.mt")

plot2 <- FeatureScatter(pbmc, feature1 ="nCount_RNA", feature2 ="nFeature_RNA")

plot1 + plot2

以nCount和gene之间的关系图为例，可以看到非常明显的一个相关性，当gene个数为2000时对应的umi在10000左右，所以在设定阈值，我们想过滤掉gene大于2000的细胞，此时umi的阈值就应该设置在10000左右。

结合以上两种图表，判断出一个合适的阈值之后，就可以进行过滤了，代码如下: (其实我没看懂这参数是怎么选择的，总觉得2000就可以了)

pbmc <- subset(pbmc, subset = nFeature_RNA >200& nFeature_RNA <2500& percent.mt <5)

===============normalize=============

预处理之后，首先进行归一化：

pbmc <- NormalizeData(pbmc, normalization.method ="LogNormalize", scale.factor =10000)

默认采用LogNormalize归一化算法，首先将原始的表达量转换成相对丰度，然后乘以scale.factor的值，在取对数。

========feature selection=============

归一化之后，Seurat提取那些在细胞间变异系数较大的基因用于下游分析:

pbmc <- FindVariableFeatures(pbmc, selection.method ="vst", nfeatures =2000)

默认的将返回2000个高变异的基因。

top10 <- head(VariableFeatures(pbmc), 10) //返回最高变异的10个基因

plot1 <- VariableFeaturePlot(pbmc)

plot1

从图中可以看出挑选的2000个高变异的基因明显分布和误差比低变异度的高。

plot2 <- LabelPoints(plot = plot1, points = top10)

plot2 //加上前10个高变异基因的label更加清晰

============scale=================

scale的目的，官方是如此描述的：

Shifts the expression of each gene, so that the mean expression across cells is 0 //保持同一个基因在不同的cell之间均值为0

Scales the expression of each gene, so that the variance across cells is 1 //保持同一个基因在不同cell之间的方差为1

This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate //这样避免高表达基因过度占据

注：我还在想normalization和scale的区别

all.genes <- rownames(pbmc)

pbmc <- ScaleData(pbmc, features = all.genes)

=============PCA================

聚类分析用于识别细胞亚型，在Seurat中，不是直接对所有细胞进行聚类分析，而是首先进行PCA主成分分析，然后挑选贡献量最大的几个主成分，用挑选出的主成分的值来进行聚类分析

对PCA分析结果可以进行一系列的可视化： print, VizDimLoadings, DimPlot, and DimHeatmap

print(pbmc[["pca"]], dims =1:5, nfeatures =5)

## PC_ 1

## Positive: CST3, TYROBP, LST1, AIF1, FTL

## Negative: MALAT1, LTB, IL32, IL7R, CD2

## PC_ 2

## Positive: CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1

## Negative: NKG7, PRF1, CST7, GZMB, GZMA

## PC_ 3

## Positive: HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1

## Negative: PPBP, PF4, SDPR, SPARC, GNG11

## PC_ 4

## Positive: HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1

## Negative: VIM, IL7R, S100A6, IL32, S100A8

## PC_ 5

## Positive: GZMB, NKG7, S100A8, FGFBP2, GNLY

## Negative: LTB, IL7R, CKB, VIM, MS4A7

VizDimLoadings(pbmc, dims =1:2, reduction ="pca")

DimPlot(pbmc, reduction ="pca")

DimHeatmap(pbmc, dims =1, cells =500, balanced =TRUE)

DimHeatmap(pbmc, dims =1:15, cells =500, balanced =TRUE)

=======找到有统计学显著性的主成分========

主成分分析结束后需要确定哪些主成分所代表的基因可以进入下游分析，这里可以使用JackStraw做重抽样分析。用JackStrawPlot可视化看看哪些主成分可以进行下游分析

pbmc <- JackStraw(pbmc, num.replicate =100)

pbmc <- ScoreJackStraw(pbmc, dims =1:20)

JackStrawPlot(pbmc, dims =1:15)

In this example, we can observe an 'elbow' around PC9-10, suggesting that the majority of true signal is captured in the first 10 PCs.

当然，也可以用最经典的碎石图来确定主成分。

ElbowPlot(pbmc)

//在这个例子中，我们可以观察到在PC9-10附近趋于平缓，这表明大部分真实信号是在前10个PCs中捕获的

这个确定主成分是非常有挑战性的: - The first is more supervised, exploring PCs to determine relevant sources of heterogeneity, and could be used in conjunction with GSEA for example. - The second implements a statistical test based on a random null model, but is time-consuming for large datasets, and may not return a clear PC cutoff. - The third is a heuristic that is commonly used, and can be calculated instantly.

在本例子里面，几种方法结果差异不大，可以在PC7~10直接挑选。

======cluster the cell==============

pbmc <- FindNeighbors(pbmc, dims =1:10)

pbmc <- FindClusters(pbmc, resolution =0.5)

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

## Number of nodes: 2638

## Number of edges: 96033

## Running Louvain algorithm...

## Maximum modularity in 10 random starts: 0.8720

## Number of communities: 9

## Elapsed time: 0 seconds

head(Idents(pbmc),5)

pbmc <- RunUMAP(pbmc, dims =1:10)

DimPlot(pbmc, reduction ="umap")

====鉴定cluster之间差异基因==========

cluster1.markers <- FindMarkers(pbmc, ident.1 =1, min.pct =0.25) //鉴定cluster 1中的marker gene

head(cluster1.markers, n =5)

cluster5.markers <- FindMarkers(pbmc, ident.1 =5, ident.2 = c(0,3), min.pct =0.25)

//鉴定cluster 5区别于cluster 0和3的marker gene

head(cluster5.markers, n =5)

pbmc.markers <- FindAllMarkers(pbmc, only.pos =TRUE, min.pct =0.25, logfc.threshold =0.25)

//find markers for every cluster compared to all remaining cells,report only the positive ones

pbmc.markers %>% group_by(cluster) %>% top_n(n =2, wt = avg_logFC)

同时，该包提供了一系列可视化方法来检查差异分析的结果的可靠性：

VlnPlot (shows expression probability distributions across clusters)

FeaturePlot (visualizes gene expression on a tSNE or PCA plot) are our most commonly used visualizations

RidgePlot, CellScatter, and DotPlot

VlnPlot(pbmc, features = c("MS4A1","CD79A"))

VlnPlot(pbmc, features = c("NKG7","PF4"), slot ="counts", log =TRUE)

FeaturePlot(pbmc, features =c("MS4A1","GNLY","CD3E","CD14","FCER1A","FCGR3A","LYZ","PPBP","CD8A"))

top20 <- pbmc.markers %>% group_by(cluster) %>% top_n(n =20, wt = avg_logFC)

DoHeatmap(pbmc, features = top20$gene) + NoLegend()

======给每个cluster注释=========

Assigning cell type identity to clusters

这个主要取决于生物学背景知识：

new.cluster.ids <- c("Naive CD4 T", "Memory CD4 T", "CD14+ Mono", "B", "CD8 T", "FCGR3A+ Mono", "NK", "DC", "Platelet")

names(new.cluster.ids) <- levels(pbmc)

pbmc <- RenameIdents(pbmc, new.cluster.ids)

DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()

saveRDS(pbmc, file = "pbmc3k_final.rds")