Cell2location 是一个原则性的贝叶斯模型,它可以解析空间转录数据中的细胞类型,并创建不同组织的全面细胞图。Cell2location 解释了变异的技术来源,并借用了不同位置的统计强度,从而使单细胞和空间转录组的集成比现有工具具有更高的灵敏度和分辨率。这是通过估计哪种细胞类型的组合,其细胞丰度可以给出空间数据中的mRNA counts数,同时模拟技术效应(平台/技术效应,污染 RNA,无法解释的方差)来实现的。
Cell2location需要未转化、未标准化的空间mRNA counts数作为输入;还需要提供给Cell2location每个位置被期望的细胞丰度,这是用来作为一个估计完整细胞丰度的先导,这个数值取决于组织,可以通过计算配对组织学图像中少数部位的细胞核来估计,可以是近似的。
Cell2location提供了从 scRNA-seq 数据中估计参考细胞类型特征的两种方法:
- 一种基于负二项回归的统计方法。开发者通常推荐使用 NB 回归,它允许将跨技术和批次处理的数据有力地结合起来,从而提高了空间映射的准确性。
- 硬编码计算个体基因的每簇平均 mRNA counts数(
cell2location.cluster_averages.compute_cluster_averages
)。当批次效应很小时,这种计算每个集群平均值的更快的硬编码方法提供了同样高的准确性。我们还建议对于非 UMI 技术使用硬编码方法,如 Smart-Seq 2。
加载需要的包
import sys
IN_COLAB = "google.colab" in sys.modules
if IN_COLAB and branch == "stable":
!pip install --quiet git+https://github.com/BayraktarLab/cell2location#egg=cell2location[tutorials]
import scanpy as sc
import anndata
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import cell2location
import scvi
from matplotlib import rcParams
rcParams['pdf.fonttype'] = 42 # enables correct plotting of text
import seaborn as sns
#定义分析结果保存在哪
results_folder = './results/lymph_nodes_analysis/'
# create paths and names to results folders for reference regression and cell2location models
ref_run_name = f'{results_folder}/reference_signatures'
run_name = f'{results_folder}/cell2location_map'
加载 Visium 和 scRNA-seq 参考数据
#通过scanpy导入公开发表的淋巴结空间转录组数据
adata_vis = sc.datasets.visium_sge(sample_id="V1_Human_Lymph_Node")
adata_vis.obs['sample'] = list(adata_vis.uns['spatial'].keys())[0]
# rename genes to ENSEMBL
adata_vis.var['SYMBOL'] = adata_vis.var_names#_names:提取行名
adata_vis.var_names = adata_vis.var['gene_ids']
adata_vis.var_names.name = None#_names.name = None将名称设为无
注:线粒体编码基因(基因名称以 MT 或 MT-开头)对空间定位无关,因为它们的表达代表了单个细胞和细胞核数据中的技术性假象,而不是线粒体的生物丰度。然而,这些基因在每个位置的mRNA中占15-40% 。因此,为了避免绘制伪影,强烈建议移除线粒体基因。
# find mitochondria-encoded (MT) genes
adata_vis.var['MT_gene'] = [gene.startswith('MT-') for gene in adata_vis.var['SYMBOL']]
# remove MT genes for spatial mapping (keeping their counts in the object)
adata_vis.obsm['MT'] = adata_vis[:, adata_vis.var['MT_gene'].values].X.toarray()#传递了稀疏矩阵,但需要密集数据。使用X.toarray()转换为密集的numpy数组
adata_vis = adata_vis[:, ~adata_vis.var['MT_gene'].values]
已发表的淋巴结 scRNA-seq 数据集通常缺乏足够的代表性的生发中心相关的免疫细胞群体,由于患者捐赠者的年龄。因此,我们将涵盖淋巴结、脾脏和扁桃体的 scRNA-seq 数据集包含在单细胞参考数据中,以确保在空间转录数据集中能够捕捉到可能存在的免疫细胞状态的全部多样性。
在这里,我们下载这个数据集,导入到anndata并改变变量名称为 ENSEMBL 基因标识符。
# Download data if not already here
import os
if not os.path.exists('./data/sc.h5ad'):
!cd ./data/ && wget https://cell2location.cog.sanger.ac.uk/paper/integrated_lymphoid_organ_scrna/RegressionNBV4Torch_57covariates_73260cells_10237genes/sc.h5ad
# Read data
adata_ref = sc.read(f'./data/sc.h5ad')
# Use ENSEMBL as gene IDs to make sure IDs are unique and correctly matched
adata_ref.var['SYMBOL'] = adata_ref.var.index
adata_ref.var.index = adata_ref.var['GeneID-2'].copy()
adata_ref.var_names = adata_ref.var['GeneID-2'].copy()
adata_ref.var.index.name = None
adata_ref.raw.var['SYMBOL'] = adata_ref.raw.var.index
adata_ref.raw.var.index = adata_ref.raw.var['GeneID-2'].copy()
adata_ref.raw.var.index.name = None
# before we estimate the reference cell type signature we recommend to perform very permissive genes selection
# in this 2D histogram orange rectangle lays over excluded genes.
# In this case, the downloaded dataset was already filtered using this method,
# hence no density under the orange rectangle
from cell2location.utils.filtering import filter_genes
selected = filter_genes(adata_ref, cell_count_cutoff=5, cell_percentage_cutoff2=0.03, nonz_mean_cutoff=1.12)
# filter the object
adata_ref = adata_ref[:, selected].copy()
参考细胞类型特征的估计(NB 回归)
# prepare anndata for the regression model
scvi.data.setup_anndata(adata=adata_ref,
# 10X reaction / sample / batch
batch_key='Sample',
# cell type, covariate used for constructing signatures
labels_key='Subset',
# multiplicative technical effects (platform, 3' vs 5', donor effect)
categorical_covariate_keys=['Method']
)
scvi.data.view_anndata_setup(adata_ref)
# create and train the regression model
from cell2location.models import RegressionModel
mod = RegressionModel(adata_ref)
# Use all data for training (validation not implemented yet, train_size=1)
mod.train(max_epochs=250, batch_size=2500, train_size=1, lr=0.002, use_gpu=True)
# plot ELBO loss history during training, removing first 20 epochs from the plot
mod.plot_history(20)
# In this section, we export the estimated cell abundance (summary of the posterior distribution).
adata_ref = mod.export_posterior(
adata_ref, sample_kwargs={'num_samples': 1000, 'batch_size': 2500, 'use_gpu': True}
)
# Save model
mod.save(f"{ref_run_name}", overwrite=True)
# Save anndata object with results
adata_file = f"{ref_run_name}/sc.h5ad"
adata_ref.write(adata_file)
adata_file
检查QC后的图
- 重建准确性去评估是否有任何组织存在问题
- 由于批次效应,估计的表达特征不同于每个集群中的平均表达。对于不受批次效应影响的 scRNA-seq 数据集(这个数据集) ,可以使用聚类平均表达式来代替使用模型估计特征。当这个图与对角线图非常不同时(例如 y 轴上的值非常低,密度无处不在) ,它表明特征估计存在问题。
mod.plot_QC()
#模型和输出 h5ad 可以像下面这样加载:
mod = cell2location.models.RegressionModel.load(f"{ref_run_name}", adata_ref)
adata_file = f"{ref_run_name}/sc.h5ad"
adata_ref = sc.read_h5ad(adata_file)
# export estimated expression in each cluster
if 'means_per_cluster_mu_fg' in adata_ref.varm.keys():
inf_aver = adata_ref.varm['means_per_cluster_mu_fg'][[f'means_per_cluster_mu_fg_{i}'
for i in adata_ref.uns['mod']['factor_names']]].copy()
else:
inf_aver = adata_ref.var[[f'means_per_cluster_mu_fg_{i}'
for i in adata_ref.uns['mod']['factor_names']]].copy()
inf_aver.columns = adata_ref.uns['mod']['factor_names']
inf_aver.iloc[0:5, 0:5]
Cell2location: 空间比对
# find shared genes and subset both anndata and reference signatures
intersect = np.intersect1d(adata_vis.var_names, inf_aver.index)
adata_vis = adata_vis[:, intersect].copy()
inf_aver = inf_aver.loc[intersect, :].copy()
# prepare anndata for cell2location model
scvi.data.setup_anndata(adata=adata_vis, batch_key="sample")
scvi.data.view_anndata_setup(adata_vis)
在这里,需要指定2个用户提供的超参数(
n_cells_per_location
和 detection_alpha
)注:虽然可以经常使用
detection_alpha
超参数的默认值,但是将预期的细胞丰度 n_cells per_location
调整到每个组织是有用的。这个值可以通过成对的组织学图像估计,如上面的注释所述。将本教程中提供的值(
n_cells_ per_location = 30
)更改为在您的组织中观察到的值。
# create and train the model
mod = cell2location.models.Cell2location(
adata_vis, cell_state_df=inf_aver,
# the expected average cell abundance: tissue-dependent
# hyper-prior which can be estimated from paired histology:
N_cells_per_location=30,
# hyperparameter controlling normalisation of
# within-experiment variation in RNA detection (using default here):
detection_alpha=200
)
mod.train(max_epochs=30000,
# train using full data (batch_size=None)
batch_size=None,
# use all data points in training because
# we need to estimate cell abundance at all locations
train_size=1,
use_gpu=True)
# plot ELBO loss history during training, removing first 100 epochs from the plot
mod.plot_history(1000)
plt.legend(labels=['full data training']);
# In this section, we export the estimated cell abundance (summary of the posterior distribution).
adata_vis = mod.export_posterior(
adata_vis, sample_kwargs={'num_samples': 1000, 'batch_size': mod.adata.n_obs, 'use_gpu': True}
)
# Save model
mod.save(f"{run_name}", overwrite=True)
# mod = cell2location.models.Cell2location.load(f"{run_name}", adata_vis)
# Save anndata object with results
adata_file = f"{run_name}/sp.h5ad"
adata_vis.write(adata_file)
adata_file
#模型和输出 h5ad 可以像下面这样加载:
mod = cell2location.models.Cell2location.load(f"{run_name}", adata_vis)
adata_file = f"{run_name}/sp.h5ad"
adata_vis = sc.read_h5ad(adata_file)
# Examine reconstruction accuracy to assess if there are any issues with mapping
# the plot should be roughly diagonal, strong deviations will signal problems
mod.plot_QC()
当整合多个空间批次时和当处理载玻片中检测到的 RNA 差异很大的数据集时(这在组织学上无法用高细胞密度来解释) ,评估 cell2location 是否使这些影响恢复正常是很重要的。
期望看到的是不同批次中有相似的总细胞数和不同的RNA检测灵敏度(这都可以通过cell2location评估)。
期望组织学中的总细胞数量反映出高细胞密度。
Visualising cell abundance in spatial coordinates
# add 5% quantile, representing confident cell abundance, 'at least this amount is present',
# to adata.obs with nice names for plotting
adata_vis.obs[adata_vis.uns['mod']['factor_names']] = adata_vis.obsm['q05_cell_abundance_w_sf']
# select one slide
from cell2location.utils import select_slide
slide = select_slide(adata_vis, 'V1_Human_Lymph_Node')
# plot in spatial coordinates
with mpl.rc_context({'axes.facecolor': 'black','figure.figsize': [4.5, 5]}):
sc.pl.spatial(slide, cmap='magma',
# show first 8 cell types
color=['B_Cycling', 'B_GC_LZ', 'T_CD4+_TfH_GC', 'FDC','B_naive', 'T_CD4+_naive', 'B_plasma', 'Endo'],
ncols=4, size=1.3,
img_key='hires',
# limit color scale at 99.2% quantile of cell abundance
vmin=0, vmax='p99.2',
save=True
)
# Now we use cell2location plotter that allows showing multiple cell types in one panel
from cell2location.plt import plot_spatial
# select up to 6 clusters
clust_labels = ['T_CD4+_naive', 'B_naive', 'FDC']
clust_col = ['' + str(i) for i in clust_labels] # in case column names differ from labels
slide = select_slide(adata_vis, 'V1_Human_Lymph_Node')
import matplotlib.pyplot as plt
with mpl.rc_context({'figure.figsize': (15, 15)}):
fig = plot_spatial(
adata=slide,
# labels to show on a plot
color=clust_col, labels=clust_labels,
show_img=True,
# 'fast' (white background) or 'dark_background'
style='fast',
# limit color scale at 99.2% quantile of cell abundance
max_color_quantile=0.992,
# size of locations (adjust depending on figure size)
circle_diameter=6,
colorbar_position='right'
)
plt.savefig('showing_multiple_cell_types_in_one_panel.pdf')
下游分析
基于 Leiden 聚类的离散组织区域识别
我们通过使用由cell2location估计的细胞丰度来确定细胞组成不同的组织区域。
我们通过估计每种细胞类型的细胞丰度来聚集 Visium 点,从而找到组织区域。
首先构造了一个 k 邻近的 neigbour (KNN)图表示细胞丰度估计值中位置的相似性,然后应用 Leiden 聚类算法进行聚类。邻近区域的数量应适应数据集的大小和解剖学定义区域的大小(例如海马区域相当小,因此可能被大n_neighbors
所掩盖)。这可以做一个范围的 KNN 近邻和Leiden聚类分辨率,直到一个聚类匹配的组织解剖结构获得。
聚类是跨所有 Visium 部分/批次共同完成的,因此区域身份是直接可比的。当多个批次之间存在强烈的技术效应时(这里不是这种情况) ,sc.external.pp.bbknn
原则上可以用来解释 KNN 构建过程中的这些效应。
生成的集群保存在 adata _ vis. obs [‘ region _ cluster’]
中
# compute KNN using the cell2location output stored in adata.obsm
sc.pp.neighbors(adata_vis, use_rep='q05_cell_abundance_w_sf',
n_neighbors = 15)
# Cluster spots into regions using scanpy
sc.tl.leiden(adata_vis, resolution=1.1)
# add region as categorical variable
adata_vis.obs["region_cluster"] = adata_vis.obs["leiden"].astype("category")
# compute UMAP using KNN graph based on the cell2location output
sc.tl.umap(adata_vis, min_dist = 0.3, spread = 1)
# show regions in UMAP coordinates
with mpl.rc_context({'axes.facecolor': 'white',
'figure.figsize': [8, 8]}):
sc.pl.umap(adata_vis, color=['region_cluster'], size=30,
color_map = 'RdPu', ncols = 2, legend_loc='on data',
legend_fontsize=20)
sc.pl.umap(adata_vis, color=['sample'], size=30,
color_map = 'RdPu', ncols = 2,
legend_fontsize=20)
# plot in spatial coordinates
with mpl.rc_context({'axes.facecolor': 'black',
'figure.figsize': [4.5, 5]}):
sc.pl.spatial(adata_vis, color=['region_cluster'],
size=1.3, img_key='hires', alpha=0.5)
利用基质因子分解(NMF)识别细胞区间/组织区域
在这里,我们使用 cell2location 映射结果来确定细胞类型的空间共存,以便更好地理解组织组织和预测细胞相互作用。我们用cell2location进行了细胞类型丰度的非负矩阵分解估计。与将 NMF 应用于常规 scRNA-seq 的有点相似,加性 NMF 分解产生了一组空间细胞类型丰度分布图,这些分布图可以捕捉共同定位的细胞类型。这种基于NMF 的分解自然地解释了多种细胞类型和微环境可以在相同的 Visium 位置共存的事实 ,同时跨组织区域(例如个体生发中心)共享信息。
提示在实际应用中,最好对一系列因子 R = 5,..,30进行 NMF 训练。并选择 R作为捕获细小组织区域和分裂已知部分之间的平衡。如果你想找到一些最独特的细胞区间,使用小一点的R。如果你想找到非常强的同位信号,并假设大多数细胞类型不同位,使用大一点的R。
下面我们将展示如何执行此分析。为了帮助这个分析,我们把这个分析包装成一个管道,这个管道可以自动训练不同R的 NMF 模型:
from cell2location import run_colocation
res_dict, adata_vis = run_colocation(
adata_vis,
model_name='CoLocatedGroupsSklearnNMF',
train_args={
'n_fact': np.arange(11, 13), # IMPORTANT: use a wider range of the number of factors (5-30)
'sample_name_col': 'sample', # columns in adata_vis.obs that identifies sample
'n_restarts': 3 # number of training restarts
},
export_args={'path': f'{run_name}/CoLocatedComb/'}
)
对于每个factor number,模型生成以下文件夹输出列表:
cell_type_fractions_heatmap/
:a dot plot of the estimated NMF weights of cell types (rows) across NMF components (columns)cell_type_fractions_mean/
:the data used for dot plotfactor_markers/
: tables listing top 10 cell types most speficic to each NMF factormodels/
:saved NMF modelspredictive_accuracy/
:2D histogram plot showing how well NMF explains cell2location outputspatial/
:NMF weights across locatinos in spatial coordinateslocation_factors_mean/
:the data used for the plot in spatial coordiantesstability_plots/
:stability of NMF weights between training restarts要检查的关键输出是
cell _ type _ fraction _ heatmap/
中的文件,其中显示了 NMF 组件(列)中对应于细胞间隔的各种类型(行)的估计 NMF 权重的点图。显示的是相对权重,每个单元类型的组件之间的规范化。
# Here we plot the NMF weights (Same as saved to `cell_type_fractions_heatmap`)
res_dict['n_fact12']['mod'].plot_cell_type_loadings()
高级应用
估计空间数据中每个基因的细胞类型特异性表达
为此,我们采用了由 Cable 等人提出的估计条件期望表达式的方法。使用 cell2location,我们可以查看后验概率分布,而不仅仅是细胞型特定表达式的估计值
# Compute expected expression per cell type
expected_dict = mod.module.model.compute_expected_per_cell_type(
mod.samples["post_sample_q05"], mod.adata
)
# Add to anndata layers
for i, n in enumerate(mod.factor_names_):
adata_vis.layers[n] = expected_dict['mu'][i]
# Save anndata object with results
adata_file = f"{run_name}/sp.h5ad"
adata_vis.write(adata_file)
adata_file
# Look at cell type specific expression in spatial coordinates,
# Here we highlight CD3D, pan T-cell marker expressed by
# 2 subtypes of T cells in distinct locations but not expressed by co-located B cells
ctypes = ['T_CD4+_TfH_GC', 'T_CD4+_naive', 'B_GC_LZ']
genes = ['CD3D', 'CR2']
with mpl.rc_context({'axes.facecolor': 'black'}):
# select one slide
slide = select_slide(adata_vis, 'V1_Human_Lymph_Node')
from tutorial_utils import plot_genes_per_cell_type
plot_genes_per_cell_type(slide, genes, ctypes);
与后验概率计算一起,计算任意分位数
# Get posterior distribution samples for specific variables
samples_w_sf = mod.sample_posterior(num_samples=1000, use_gpu=True, return_samples=True,
batch_size=2020,
return_sites=['w_sf', 'm_g', 'u_sf_mRNA_factors'])
# samples_w_sf['posterior_samples'] contains 1000 samples as arrays with dim=(num_samples, ...)
samples_w_sf['posterior_samples']['w_sf'].shape
# Compute any quantile of the posterior distribution
medians = mod.posterior_quantile(q=0.5, use_gpu = True)
with mpl.rc_context({'axes.facecolor': 'white',
'figure.figsize': [5, 5]}):
plt.scatter(medians['w_sf'].flatten(), mod.samples['post_sample_means']['w_sf'].flatten());
plt.xlabel('median');
plt.ylabel('mean');
参考:https://cell2location.readthedocs.io/en/latest/notebooks/cell2location_tutorial.html