R在读取和处理数据的过程中会将所有的变量和占用都储存在RAM当中,这样一来,对于海量的单细胞RNA-seq数据(尤其是超过250k的细胞量),即使在服务器当中运行,Seurat、metacell、monocle这一类的R包的使用还是会产生内存不足的问题。
但是最近我发现了一个基于python的单细胞基因表达分析包scanpy,能够很好地在我这个仅4G内存的小破机上分析378k的细胞,并且功能丰富程度不亚于Seurat。它包含了数据预处理、可视化、聚类、伪时间分析和轨迹推断、差异表达分析。根据官网描述,scanpy能够有效处理超过1,000,000个细胞的数据集。
Scanpy – Single-Cell Analysis in Python:https://scanpy.readthedocs.io/en/latest/
安装与数据下载
安装好python3之后,终端运行:
pip install scanpy
若安装过程出现问题,请参考:
https://scanpy.readthedocs.io/en/latest/installation.html
骨髓单细胞转录组测序数据下载地址:
https://preview.data.humancellatlas.org
Step0, 读取数据
运行python
import numpy as np
import pandas as pd
import scanpy as sc
# 可以直接读取10Xgenomics的.h5格式数据
adata = sc.read_10x_h5("/Users/shinianyike/Desktop/ica_bone_marrow_h5.h5"
, genome=None, gex_only=True)
adata.var_names_make_unique()
查看数据:
adata
AnnData object with n_obs × n_vars = 378000 × 33694
var: 'gene_ids'
共378000个细胞,33694个基因。
Step1, 数据预处理
这一步目的将数据进行筛选和过滤,一些测序质量差的数据会让后续的分析产生噪音和干扰,因此我们需要将它们去除。
展示在所有的细胞当中表达占比最高的20个基因。
sc.pl.highest_expr_genes(adata, n_top=20)
基础过滤:
去除表达基因200以下的细胞;
去除在3个细胞以下表达的基因。
sc.pp.filter_cells(adata, min_genes=200) # 去除表达基因200以下的细胞
sc.pp.filter_genes(adata, min_cells=3) # 去除在3个细胞以下表达的基因
adata
AnnData object with n_obs × n_vars = 335618 × 24888
obs: 'n_genes'
var: 'gene_ids', 'n_cells'
共335618个细胞,24888个基因。(可以发现细胞跟基因数量都减少了)
质量控制:
在测序后,有很大比例是低质量的细胞,可能是因为细胞破碎造成的胞质RNA流失,由于线粒体比单个的转录分子要大得多,不容易在破碎的细胞膜中泄漏出去,这样一来就导致测序结果显示线粒体基因的比例在细胞内占比异常高。这一步质量控制的目的就是将这些低质量的细胞去除掉。
计算线粒体基因占所有基因的比例:
mito_genes = adata.var_names.str.startswith('MT-')
# for each cell compute fraction of counts in mito genes vs. all genes
# the `.A1` is only necessary as X is sparse (to transform to a dense array after summing)
adata.obs['percent_mito'] = np.sum(
adata[:, mito_genes].X, axis=1).A1 / np.sum(adata.X, axis=1).A1
# add the total counts per cell as observations-annotation to adata
adata.obs['n_counts'] = adata.X.sum(axis=1).A1
作小提琴图,查看线粒体基因占比分布:
sc.pl.violin(adata, ['n_genes', 'n_counts', 'percent_mito'],
jitter=0.4, multi_panel=True)
由于数据点实在太多,小提琴已被数据点覆盖,无法显示出来。
这里还可以做一个散点图,也可以直观地显示出一些异常分布的数据点。
sc.pl.scatter(adata, x='n_counts', y='percent_mito')
sc.pl.scatter(adata, x='n_counts', y='n_genes')
adata
AnnData object with n_obs × n_vars = 335618 × 24888
obs: 'n_genes', 'percent_mito', 'n_counts'
var: 'gene_ids', 'n_cells'
335618个细胞,24888个基因;
下面这一步进行真正的过滤,根据上面做的图,去除基因数目过多(大于等于4000)和线粒体基因占比过多(大于等于0.3)的细胞:
adata = adata[adata.obs['n_genes'] < 4000, :]
adata = adata[adata.obs['percent_mito'] < 0.3, :]
过滤后查看剩下多少细胞和基因。
adata
View of AnnData object with n_obs × n_vars = 328435 × 24888
obs: 'n_genes', 'percent_mito', 'n_counts'
var: 'gene_ids', 'n_cells'
328435个细胞,24888个基因。
数据标准化
在绘图之前,还要进行一步数据标准化,将表达量用对数计算一遍,这样有利于绘图和展示。
sc.pp.normalize_per_cell(adata, counts_per_cell_after=1e4)
sc.pp.log1p(adata)
adata.raw = adata # 储存标准化后的AnnaData Object
识别差异表达基因
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
sc.pl.highly_variable_genes(adata)
将保守的基因去除,留下差异表达的基因用于后续分析。
adata = adata[:, adata.var['highly_variable']]
adata
View of AnnData object with n_obs × n_vars = 328435 × 1372
obs: 'n_genes', 'percent_mito', 'n_counts'
var: 'gene_ids', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
328435个细胞,1372个基因。
回归每个细胞总计数和线粒体基因表达百分比的影响。将数据放缩到方差为1。单细胞数据集可能包含“不感兴趣”的变异来源。这不仅包括技术噪音,还包括批次效应,甚至包括生物变异来源(细胞周期阶段)。正如(Buettner, et al NBT,2015)中所建议的那样,从分析中回归这些信号可以改善下游维数减少和聚类。
这一步高内存需求预警,建议清理电脑缓存,关闭后台不使用的应用。
sc.pp.regress_out(adata, ['n_counts', 'percent_mito'])
/Users/shinianyike/anaconda3/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
from pandas.core import datetools
Scale each gene to unit variance. Clip values exceeding standard deviation 10.
sc.pp.scale(adata, max_value=10)
Step2, 主成分分析
主成分分析是一种将数据降维的分析方法,是考察多个变量间相关性一种多元统计方法,研究如何通过少数几个主成分来揭示多个变量间的内部结构,即从原始变量中导出少数几个主成分,使它们尽可能多地保留原始变量的信息,且彼此间互不相关.通常数学上的处理就是将原来P个指标作线性组合,作为新的综合指标。
sc.tl.pca(adata, svd_solver='arpack') # PCA分析
sc.pl.pca(adata, color='CST3') #绘图
作碎石图,观测主成分的质量。这个图用于选择后续应该使用多少个PC,用于计算细胞间的相邻距离。
sc.pl.pca_variance_ratio(adata, log=True)
在这里先将数据保存,便于后续操作的更改。
adata.write("pca_results.h5ad")
Step3, 聚类分析
计算细胞间的距离:
这里的参数就先按照默认值设定:
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
参数说明:
n_neighbors指的是每个点的邻近点的数量,据评论区@小光amateur 所说neighbors的个数越多,聚类数会越少。
pc的数量依赖于上面所做的碎石图,一般是选在拐点处的的主成分,只需要一个粗略值,不同的pc数量所产生的聚类形状也不同。我后来更改为PC=16,效果比下图要好一些。
将距离嵌入图中:
sc.tl.umap(adata)
sc.pl.umap(adata, color=['CST3', 'NKG7', 'PPBP'])
sc.pl.umap(adata, color=['CST3', 'NKG7', 'PPBP'], use_raw=False)
聚类
sc.tl.louvain(adata)
sc.pl.umap(adata, color=['louvain'])
这里得到了29类细胞,每个颜色代表一种。
将数据保存。
adata.write("umap.h5ad")
寻找marker基因
marker基因通常是细胞表面抗原,用于定义出该细胞的细胞类型。
为了定义每个簇属于什么细胞,根据基因的差异表达水平,将每个簇排名前25的基因导出。
sc.tl.rank_genes_groups(adata, 'louvain', method='wilcoxon')
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)
Preprosessing the data
import numpy as np
import pandas as pd
import matplotlib.pyplot as pl
from matplotlib import rcParams
import scanpy as sc
sc.settings.verbosity = 3 # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_versions()
scanpy==1.4 anndata==0.6.19 numpy==1.14.5 scipy==1.1.0 pandas==0.23.4 scikit-learn==0.19.2 statsmodels==0.9.0 python-igraph==0.7.1 louvain==0.6.1
adata = sc.read_h5ad("/bone_marrow/scanpy/3_29_PC16_filterMore/umap_tsne_3_29.h5ad")
sc.tl.draw_graph(adata)
drawing single-cell graph using layout "fa"
finished (1:06:05.80) --> added
'X_draw_graph_fa', graph_drawing coordinates (adata.obsm)
sc.pl.draw_graph(adata, color='louvain', legend_loc='on data',title = "")
Denoising the graph(will skip it next time!)
sc.tl.diffmap(adata)
sc.pp.neighbors(adata, n_neighbors=10, use_rep='X_diffmap')
computing Diffusion Maps using n_comps=15(=n_dcs)
eigenvalues of transition matrix
[1. 0.99998933 0.9999825 0.9999806 0.9999773 0.99997413
0.99997026 0.999969 0.99996084 0.9999516 0.9999409 0.9999385
0.9999321 0.99992156 0.9999118 ]
finished (0:11:57.51) --> added
'X_diffmap', diffmap coordinates (adata.obsm)
'diffmap_evals', eigenvalues of transition matrix (adata.uns)
computing neighbors
finished (0:01:30.84) --> added to `.uns['neighbors']`
'distances', distances for each pair of neighbors
'connectivities', weighted adjacency matrix
sc.tl.draw_graph(adata)
drawing single-cell graph using layout "fa"
finished (1:05:24.51) --> added
'X_draw_graph_fa', graph_drawing coordinates (adata.obsm)
sc.pl.draw_graph(adata, color='louvain', legend_loc='on data',title = "")
..didn't see any denoising effect
PAGA
Annotate the clusters using marker genes.
sc.tl.paga(adata, groups='louvain')
running PAGA
finished (0:00:13.69) --> added
'paga/connectivities', connectivities adjacency (adata.uns)
'paga/connectivities_tree', connectivities subtree (adata.uns)
sc.pl.paga(adata, color=['louvain'],title = "")
--> added 'pos', the PAGA positions (adata.uns['paga'])
sc.pl.paga(adata, color=['CD34', 'GYPB', 'MS4A1', 'IL7R'])
--> added 'pos', the PAGA positions (adata.uns['paga'])
Annote groups with cell type
adata.obs['louvain'].cat.categories
Index(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12',
'13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23'],
dtype='object')
adata.obs['louvain_anno'] = adata.obs['louvain']
# annote them with names
adata.obs['louvain_anno'].cat.categories = ['0/T', '1/B', '2', '3/T', '4/MDDC', '5', '6/MDDC', '7/NK', '8/MDDC', '9/CD8+T', '10/NK', '11/B', '12/NRBC',
'13', '14/CD1C-CD141-DC', '15/pDC', '16/Macro,DC', '17/pDC', '18/DC', '19/transB,Plasmab', '20','21','22/B,NK','23']
Cluster | Cell Type | Marker Gene |
---|---|---|
0 | T cell/IL-17Ralpha T cell | IL7R, CD3E, CD3D |
1 | B cell | MS4A1, CD79A |
2 | 高表达核糖体蛋白基因 | |
3 | CD8+ T cell, T helper, angiogenic T cell | CD3E, CXCR4, CD3D, CCL5, GZMK |
4 | Monocyte derived dendritic cell | S100A8, S100A9 |
5 | 高表达核糖体蛋白基因 | |
6 | Monocyte derived dendritic cell | S100A8, S100A9 |
7 | NK cell | PRF1, NKG7, KLRB1, KLRD1 |
8 | Monocyte derived dendritic cell | S100A8, S100A9 |
9 | CD8+ T cell | GZMK, CD3D, CD8A, NKG7 * |
10 | NK Cell | GNLY, NKG7, PTPRC |
11 | B cell | CD24, CD79A, CD37, CD79B |
12 | Red blood cell(Erythrocyte) | HBB, HBA1,GYPA |
13 | not known | |
14 | CD1C-CD141- dendritic cell | FCGR3A, CST3 |
15 | Plasmacytiod dendritic cell | HSP90B1, SSR4, PDIA4, SEC11C, MZB1, UBE2J1, FKBP2, DERL3, HERPUD1, ITM2C |
16 | Macrophage/ dendritic cell | LYZ, HLA-DQA1, AIF1, CD74, FCER1A, CST3 |
17 | Plasmacytiod dendritic cell | IRF8, TCF4, LILRA4 * |
18 | Megakaryocyte progenitor cell/Megakaryocyte | PF4, PPBP, / GP9 |
19 | transitional B cell / Plasmablast | CD24, CD79B |
20 | not known | |
21 | B cell | MS4A1, CD79A, CD37, CD74 |
22 | B cell , NK cell | CD74,CD79A, NKG7, GZMH |
23 | not known |
上面这个大家看看就好,我自己也不确定,请自行翻阅文献!!!
sc.tl.paga(adata, groups='louvain_anno')
running PAGA
finished (0:00:13.55) --> added
'paga/connectivities', connectivities adjacency (adata.uns)
'paga/connectivities_tree', connectivities subtree (adata.uns)
sc.pl.paga(adata, threshold=0.03)
--> added 'pos', the PAGA positions (adata.uns['paga'])
adata
AnnData object with n_obs × n_vars = 315509 × 1314
obs: 'n_genes', 'percent_mito', 'n_counts', 'louvain', 'louvain_anno'
var: 'gene_ids', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
uns: 'louvain', 'louvain_colors', 'neighbors', 'pca', 'draw_graph', 'diffmap_evals', 'paga', 'louvain_sizes', 'louvain_anno_sizes', 'louvain_anno_colors'
obsm: 'X_pca', 'X_umap', 'X_tsne', 'X_draw_graph_fa', 'X_diffmap'
varm: 'PCs'
sc.tl.draw_graph(adata, init_pos='paga')
drawing single-cell graph using layout "fa"
finished (1:03:55.77) --> added
'X_draw_graph_fa', graph_drawing coordinates (adata.obsm)
Add pesudotime parameters
# the most primitive cell is refered as 0 persudotime.
# Group 13 is the nearest cell population to Hematopoietic stem cell.
adata.uns['iroot'] = np.flatnonzero(adata.obs['louvain_anno'] == '13')[0]
sc.tl.dpt(adata)
computing Diffusion Pseudotime using n_dcs=10
finished (0:00:00.04) --> added
'dpt_pseudotime', the pseudotime (adata.obs)
sc.pl.draw_graph(adata, color=['louvain_anno', 'dpt_pseudotime'],
legend_loc='right margin',title = ['','pseudotime'])
sc.pl.draw_graph(adata, color=['louvain_anno'],
legend_loc='right margin',title = ['']) #plot again to see full legends info
try other "iroot" setting
adata.uns['iroot'] = np.flatnonzero(adata.obs['louvain_anno'] == '5')[0]
sc.tl.dpt(adata)
sc.pl.draw_graph(adata, color=['louvain_anno', 'dpt_pseudotime'],
legend_loc='right margin',title = ['','pseudotime'])
computing Diffusion Pseudotime using n_dcs=10
finished (0:00:00.04) --> added
'dpt_pseudotime', the pseudotime (adata.obs)
Several other cell types are chosen to be "root" for diffusion pseudotime, however the pseudotime graphs look no big different.
..it doesn't look meaningful. didn't see any trajectory to describe cell development.
I think the "denoising graph" step is to blame. Will skip it next time.
Otherwise i should zoom it into a specific cell population, but have no idea which kind of cell i should choose...
Beautify the graphs
Choose the colors of the clusters a bit more consistently.
pl.figure(figsize=(8, 2))
for i in range(28):
pl.scatter(i, 1, c=sc.pl.palettes.zeileis_26[i], s=200)
pl.show()
zeileis_colors = np.array(sc.pl.palettes.zeileis_26)
new_colors = np.array(adata.uns['louvain_anno_colors'])
new_colors[[13]] = zeileis_colors[[12]] # Stem(?) colors / green
new_colors[[12]] = zeileis_colors[[5]] # Ery colors / red
new_colors[[4,6,8,15,17]] = zeileis_colors[[17,17,17,16,16]] # monocyte derived dendritic cell and pDC/ yellow
new_colors[[14,16,18]] = zeileis_colors[[16,16,16]] # DC / yellow
new_colors[[0,3,9]] = zeileis_colors[[6,6,6]] # T cell / light blue
new_colors[[7,10]] = zeileis_colors[[0,0]] # NK cell / dark blue
new_colors[[1,11,22,19]] = zeileis_colors[[22,22,22,21]] # B cell / pink
new_colors[[21,23,20]] = zeileis_colors[[25,25,25]] # Not known / grey
new_colors[[2, 5]] = zeileis_colors[[25, 25]] # outliers / grey
adata.uns['louvain_anno_colors'] = new_colors
adata
AnnData object with n_obs × n_vars = 315509 × 1314
obs: 'n_genes', 'percent_mito', 'n_counts', 'louvain', 'louvain_anno', 'dpt_pseudotime'
var: 'gene_ids', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
uns: 'louvain', 'louvain_colors', 'neighbors', 'pca', 'draw_graph', 'diffmap_evals', 'paga', 'louvain_sizes', 'louvain_anno_sizes', 'louvain_anno_colors', 'iroot'
obsm: 'X_pca', 'X_umap', 'X_tsne', 'X_draw_graph_fa', 'X_diffmap'
varm: 'PCs'
sc.pl.draw_graph(adata, color=['louvain_anno'],
legend_loc='right margin',title = [''])
this is a piece of shit.(翻车了 ,哈哈)
<article class="_2rhmJa" style="box-sizing: border-box; display: block; font-weight: 400; line-height: 1.8; margin-bottom: 20px; color: rgb(64, 64, 64); font-family: -apple-system, BlinkMacSystemFont, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Segoe UI", "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "Helvetica Neue", Helvetica, Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">
与上次翻车实验的不同:
- 过滤了更多的细胞和基因。在处理数据时,低质量的细胞一定要清除掉,告诫大家宁缺毋滥。。。否则后续分析真的是一堆的噪点。
- 跳过了scanpy中的降噪步骤。scanpy的降噪算法做的真不咋地,一个好图降噪之后更加混乱了。这个故事告诉我们一个道理:不要迷信作者说的话!!!
对翻车实验感兴趣的话:https://www.jianshu.com/p/e688646a451b
什么是轨迹推断/伪时间分析?
轨迹推断对应英文是trajectory inference,伪时间分析对应英文是pseudotime analysis。其实它们做的是同一件事。
在单细胞转录组测序完成的时候,无论是细胞在发育、分化还是凋亡的过程中,它们的生命活动已经在这一时刻静止了,因此我们可以将这些测序数据看作是这一个时刻细胞呈现的表达状态的瞬时截图。
而轨迹推断/伪时间分析则构建了一个细胞发育和分化的模型。这一个瞬时截图内的细胞在各种各样不同的发育状态,有的发育早,有的发育晚,有的分化了,有的未分化,甚至有的处于中间态。而决定细胞分化往往就是基因表达的变化。
轨迹推断利用各类细胞基因表达的连续性建立了一个“分化轨迹”,再将这些细胞投射到这个“分化轨迹”上。如此一来,这个静态的截图就呈现出了动态的过程。
PS:细胞类型分类真是一场艰苦卓绝的战斗!!!一入文献深似海!!!!
0. 数据预处理
先加载需要的包:
import numpy as np
import pandas as pd
import matplotlib.pyplot as pl
from matplotlib import rcParams
import scanpy as sc
sc.settings.verbosity = 3 # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_versions()
版本信息:
scanpy==1.4 anndata==0.6.19 numpy==1.14.5 scipy==1.1.0 pandas==0.23.4 scikit-learn==0.19.2 statsmodels==0.9.0 python-igraph==0.7.1 louvain==0.6.1
读取数据(数据来源详见https://www.jianshu.com/p/b190efae4d31)
这个数据是利用scanpy进行聚类后的对象。
adata = sc.read_h5ad("/data1/zll/project/deepBase3/HCA/bone_marrow/scanpy/3_29_PC16_filterMore/umap_tsne_3_29.h5ad")
预处理数据,计算距离并可视化。
sc.tl.draw_graph(adata)
sc.pl.draw_graph(adata, color='louvain', legend_loc='on data',title = "")
1. Partition-based Graph Abstraction(PAGA)
这是一种基于空间划分的抽提细胞分化“骨架”的一种算法,用于显示细胞的分化轨迹,得到哪些cluster之间的关系更接近。
sc.tl.paga(adata, groups='louvain')
sc.pl.paga(adata, color=['louvain'],title = "")
AVP是造血干细胞表达的一个基因,给这个基因上色,我们可以看到基本上都富集在了cluster13的位置。那么有很大可能这个cluster就是造血干细胞。
sc.pl.paga(adata, color=['AVP'])
2.注释细胞类型
上面的每个cluster都是用数字编号代替,为了能更清楚地知道这些细胞之间的关系,将细胞类型的信息注释上去。细胞类型的确定主要利用了marker基因以及现有的文献和资料,是一个非常复杂的过程。总之在与文献的一番斗争之后,找到了以下这些骨髓的细胞类型(仅供参考):
Cluster | Cell Type | Marker Gene |
---|---|---|
0 | naive T cell | CD3E, CD3D, RPS29 |
1 | B cell | MS4A1, CD79A |
2 | naive T cell | RPL34,RPS6 |
3 | CD8+ T cell | CCL5, GZMK, IL32 |
4 | Neutrophil | S100A8, S100A9,VCAN |
5 | naive T cell | RPL32,RPL13 |
6 | Neutrophil | FTL, S100A8, S100A9 |
7 | NK cell | PRF1, NKG7, KLRB1, KLRD1 |
8 | Eosinophil | LYZ, LGALS1,MT-CO3 |
9 | CD8+ T cell | GZMK, CD3D, CD8A, NKG7 |
10 | Eosinophil | MT-CO2, MT-CYB,MT-CO3 |
11 | CD34+ pre-B | CD79B, VPREB3 |
12 | Nuceated Red blood cell(Erythroblast) | HBB, HBA1,AHSP |
13 | CD34+ CMP,CD34+ HSC,Early-ERP | HNRNPA1, BTF3,GNB2L1,NPM1 |
14 | Monocyte | FCGR3A, LST1, AIF1 |
15 | Plasma Cell | MZB1, SSR4, FKBP11 |
16 | pre-dendritic cell | CST3, HLA-DPA1,FCER1A |
17 | dendritic cell | ITM2C, SEC61B,JCHAIN |
18 | Platelet | PF4, PPBP, GP9 |
19 | pro-B cell | HMGB1, IGLL1,STMN1 |
20 | Stromal cell | AOX1, SEPP1 |
21 | Eosinophil | MT-CO2, MT-ND3,MT-CO3, |
22 | B cell , NK cell (mixture,not sure) | CD74,CD79A, NKG7, GZMH |
23 | pro-B | STMN1, TUBA1B |
下面进行注释:
adata.obs['louvain'].cat.categories
返回共有24个cluster:
Index(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12',
'13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23'],
dtype='object')
将编号所对应的细胞类型注释上去:
adata.obs['louvain_anno'] = adata.obs['louvain']
# annote them with names
adata.obs['louvain_anno'].cat.categories = ['0/nT', '1/B', '2nT', '3/CD8+T',
'4/Neu', '5/nT', '6/Neu', '7/NK',
'8/Eos', '9/CD8+T', '10/Eos', '11/pre-B',
'12/NRBC','13/HSC', '14/Mon', '15/plasma',
'16/preDC', '17/DC', '18/plt', '19/pro-B',
'20/strom','21/Eos','22/','23/pro-B']
3. 计算 PAGA并可视化
sc.tl.paga(adata, groups='louvain_anno')
sc.pl.paga(adata, threshold=0.03)
于是就得到了这样像“骨架”一样的结果。这个图显示的就是细胞与细胞之间的关系,距离越近表示关系越接近。
4. 利用PAGA重新计算细胞之间的距离
还记得我们第0步计算的距离吗?现在我们要将细胞在PAGA这个“骨架”上重现出来,就利用PAGA的计算结果,把细胞放上去。
sc.tl.draw_graph(adata, init_pos='paga')
sc.pl.draw_graph(adata, color=['louvain_anno'], legend_loc='right margin')
5.(可选)美化图片
Choose the colors of the clusters a bit more consistently.
pl.figure(figsize=(8, 2))
for i in range(28):
pl.scatter(i, 1, c=sc.pl.palettes.zeileis_26[i], s=200)
pl.show()
zeileis_colors = np.array(sc.pl.palettes.zeileis_26)
new_colors = np.array(adata.uns['louvain_anno_colors'])
new_colors[[13]] = zeileis_colors[[17]] # Stem colors / yellow
new_colors[[12]] = zeileis_colors[[5]] # Ery colors / red
new_colors[[4,6]] = zeileis_colors[[12,12]] # Neutrophil / green
new_colors[[8,10]] = zeileis_colors[[11,11]] # Eosinophil / light red
new_colors[[14]] = zeileis_colors[[19]] # monocyte / light green
new_colors[[16,17]] = zeileis_colors[[18,26]] # DC / yellow
new_colors[[0,2,3,5,9]] = zeileis_colors[[7,7,6,7,6]] # naive T & CD8+T / light blue
new_colors[[7]] = zeileis_colors[[0]] # NK cell / dark blue
new_colors[[1,11,19,23]] = zeileis_colors[[23,22,9,9]] # B cell / pink
new_colors[[22]] = zeileis_colors[[25]] # Not known / grey
new_colors[[15]] = zeileis_colors[[3]] #plasma cell / blue grey
new_colors[[18]] = zeileis_colors[[4]] #platelet / pink grey
adata.uns['louvain_anno_colors'] = new_colors
And add some white space to some cluster names.
sc.pl.paga_compare(
adata, threshold=0.03, title='', right_margin=0.2, size=10, edge_width_scale=0.5,
legend_fontsize=12, fontsize=12, frameon=False, edges=True, save=True)
--> added 'pos', the PAGA positions (adata.uns['paga'])
saving figure to file ./figures/paga_compare.pdf
6. 计算伪时间值
首先需要定义一个伪时间值为0的细胞群体,一般来说就是干细胞或者祖细胞。总之就是最原始的细胞类型。
# the most primitive cell is refered as 0 persudotime.
# Group 13 is the nearest cell population to Hematopoietic stem cell.
adata.uns['iroot'] = np.flatnonzero(adata.obs['louvain_anno'] == '13/HSC')[0]
sc.tl.dpt(adata)
sc.pl.draw_graph(adata, color=['louvain_anno', 'dpt_pseudotime'],
legend_loc='right margin',title = ['','pseudotime'])
伪时间图的分化轨迹无法用色彩体现出来,可能是伪时间参数的问题。
上面的图注被遮住了,所以单独做一个图。
#plot again to see full legends info
sc.pl.draw_graph(adata, color=['louvain_anno'],
legend_loc='right margin',title = [''])
保存结果文件:
adata.write("/data1/zll/project/deepBase3/HCA/bone_marrow/scanpy/3_31_traj/trajectory_3_31.h5ad")
关于scanpy如何进行多样本整合数据分析,还望道友们不吝赐教!