cell2location学习笔记

Cell2location 是一个原则性的贝叶斯模型，它可以解析空间转录数据中的细胞类型，并创建不同组织的全面细胞图。Cell2location 解释了变异的技术来源，并借用了不同位置的统计强度，从而使单细胞和空间转录组的集成比现有工具具有更高的灵敏度和分辨率。这是通过估计哪种细胞类型的组合，其细胞丰度可以给出空间数据中的mRNA counts数，同时模拟技术效应(平台/技术效应，污染 RNA，无法解释的方差)来实现的。
Cell2location需要未转化、未标准化的空间mRNA counts数作为输入；还需要提供给Cell2location每个位置被期望的细胞丰度，这是用来作为一个估计完整细胞丰度的先导，这个数值取决于组织，可以通过计算配对组织学图像中少数部位的细胞核来估计，可以是近似的。
Cell2location提供了从 scRNA-seq 数据中估计参考细胞类型特征的两种方法:

一种基于负二项回归的统计方法。开发者通常推荐使用 NB 回归，它允许将跨技术和批次处理的数据有力地结合起来，从而提高了空间映射的准确性。
硬编码计算个体基因的每簇平均 mRNA counts数(cell2location.cluster_averages.compute_cluster_averages)。当批次效应很小时，这种计算每个集群平均值的更快的硬编码方法提供了同样高的准确性。我们还建议对于非 UMI 技术使用硬编码方法，如 Smart-Seq 2。

image.png

加载需要的包

import sys

IN_COLAB = "google.colab" in sys.modules
if IN_COLAB and branch == "stable":
    !pip install --quiet git+https://github.com/BayraktarLab/cell2location#egg=cell2location[tutorials]

import scanpy as sc
import anndata
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl

import cell2location
import scvi

from matplotlib import rcParams
rcParams['pdf.fonttype'] = 42 # enables correct plotting of text
import seaborn as sns
#定义分析结果保存在哪
results_folder = './results/lymph_nodes_analysis/'

# create paths and names to results folders for reference regression and cell2location models
ref_run_name = f'{results_folder}/reference_signatures'
run_name = f'{results_folder}/cell2location_map'

加载 Visium 和 scRNA-seq 参考数据

#通过scanpy导入公开发表的淋巴结空间转录组数据
adata_vis = sc.datasets.visium_sge(sample_id="V1_Human_Lymph_Node")
adata_vis.obs['sample'] = list(adata_vis.uns['spatial'].keys())[0]

# rename genes to ENSEMBL
adata_vis.var['SYMBOL'] = adata_vis.var_names#_names:提取行名
adata_vis.var_names = adata_vis.var['gene_ids']
adata_vis.var_names.name = None#_names.name = None将名称设为无

注：线粒体编码基因(基因名称以 MT 或 MT-开头)对空间定位无关，因为它们的表达代表了单个细胞和细胞核数据中的技术性假象，而不是线粒体的生物丰度。然而，这些基因在每个位置的mRNA中占15-40% 。因此，为了避免绘制伪影，强烈建议移除线粒体基因。

# find mitochondria-encoded (MT) genes
adata_vis.var['MT_gene'] = [gene.startswith('MT-') for gene in adata_vis.var['SYMBOL']]

# remove MT genes for spatial mapping (keeping their counts in the object)
adata_vis.obsm['MT'] = adata_vis[:, adata_vis.var['MT_gene'].values].X.toarray()#传递了稀疏矩阵，但需要密集数据。使用X.toarray（）转换为密集的numpy数组
adata_vis = adata_vis[:, ~adata_vis.var['MT_gene'].values]

已发表的淋巴结 scRNA-seq 数据集通常缺乏足够的代表性的生发中心相关的免疫细胞群体，由于患者捐赠者的年龄。因此，我们将涵盖淋巴结、脾脏和扁桃体的 scRNA-seq 数据集包含在单细胞参考数据中，以确保在空间转录数据集中能够捕捉到可能存在的免疫细胞状态的全部多样性。
在这里，我们下载这个数据集，导入到anndata并改变变量名称为 ENSEMBL 基因标识符。

# Download data if not already here
import os
if not os.path.exists('./data/sc.h5ad'):
    !cd ./data/ && wget https://cell2location.cog.sanger.ac.uk/paper/integrated_lymphoid_organ_scrna/RegressionNBV4Torch_57covariates_73260cells_10237genes/sc.h5ad

# Read data
adata_ref = sc.read(f'./data/sc.h5ad')

# Use ENSEMBL as gene IDs to make sure IDs are unique and correctly matched
adata_ref.var['SYMBOL'] = adata_ref.var.index
adata_ref.var.index = adata_ref.var['GeneID-2'].copy()
adata_ref.var_names = adata_ref.var['GeneID-2'].copy()
adata_ref.var.index.name = None
adata_ref.raw.var['SYMBOL'] = adata_ref.raw.var.index
adata_ref.raw.var.index = adata_ref.raw.var['GeneID-2'].copy()
adata_ref.raw.var.index.name = None

# before we estimate the reference cell type signature we recommend to perform very permissive genes selection
# in this 2D histogram orange rectangle lays over excluded genes.
# In this case, the downloaded dataset was already filtered using this method,
# hence no density under the orange rectangle
from cell2location.utils.filtering import filter_genes
selected = filter_genes(adata_ref, cell_count_cutoff=5, cell_percentage_cutoff2=0.03, nonz_mean_cutoff=1.12)

# filter the object
adata_ref = adata_ref[:, selected].copy()

image.png

参考细胞类型特征的估计(NB 回归)

# prepare anndata for the regression model
scvi.data.setup_anndata(adata=adata_ref,
                        # 10X reaction / sample / batch
                        batch_key='Sample',
                        # cell type, covariate used for constructing signatures
                        labels_key='Subset',
                        # multiplicative technical effects (platform, 3' vs 5', donor effect)
                        categorical_covariate_keys=['Method']
                       )
scvi.data.view_anndata_setup(adata_ref)

image.png

# create and train the regression model
from cell2location.models import RegressionModel
mod = RegressionModel(adata_ref)

# Use all data for training (validation not implemented yet, train_size=1)
mod.train(max_epochs=250, batch_size=2500, train_size=1, lr=0.002, use_gpu=True)

# plot ELBO loss history during training, removing first 20 epochs from the plot
mod.plot_history(20)

image.png

# In this section, we export the estimated cell abundance (summary of the posterior distribution).
adata_ref = mod.export_posterior(
    adata_ref, sample_kwargs={'num_samples': 1000, 'batch_size': 2500, 'use_gpu': True}
)

# Save model
mod.save(f"{ref_run_name}", overwrite=True)

# Save anndata object with results
adata_file = f"{ref_run_name}/sc.h5ad"
adata_ref.write(adata_file)
adata_file

检查QC后的图

重建准确性去评估是否有任何组织存在问题
由于批次效应，估计的表达特征不同于每个集群中的平均表达。对于不受批次效应影响的 scRNA-seq 数据集(这个数据集) ，可以使用聚类平均表达式来代替使用模型估计特征。当这个图与对角线图非常不同时(例如 y 轴上的值非常低，密度无处不在) ，它表明特征估计存在问题。

mod.plot_QC()

image.png

#模型和输出 h5ad 可以像下面这样加载:
mod = cell2location.models.RegressionModel.load(f"{ref_run_name}", adata_ref)
adata_file = f"{ref_run_name}/sc.h5ad"
adata_ref = sc.read_h5ad(adata_file)

# export estimated expression in each cluster
if 'means_per_cluster_mu_fg' in adata_ref.varm.keys():
    inf_aver = adata_ref.varm['means_per_cluster_mu_fg'][[f'means_per_cluster_mu_fg_{i}'
                                    for i in adata_ref.uns['mod']['factor_names']]].copy()
else:
    inf_aver = adata_ref.var[[f'means_per_cluster_mu_fg_{i}'
                                    for i in adata_ref.uns['mod']['factor_names']]].copy()
inf_aver.columns = adata_ref.uns['mod']['factor_names']
inf_aver.iloc[0:5, 0:5]

image.png

Cell2location: 空间比对

# find shared genes and subset both anndata and reference signatures
intersect = np.intersect1d(adata_vis.var_names, inf_aver.index)
adata_vis = adata_vis[:, intersect].copy()
inf_aver = inf_aver.loc[intersect, :].copy()

# prepare anndata for cell2location model
scvi.data.setup_anndata(adata=adata_vis, batch_key="sample")
scvi.data.view_anndata_setup(adata_vis)

image.png

在这里，需要指定2个用户提供的超参数(n_cells_per_location 和 detection_alpha)
注:虽然可以经常使用 detection_alpha 超参数的默认值，但是将预期的细胞丰度 n_cells per_location调整到每个组织是有用的。
这个值可以通过成对的组织学图像估计，如上面的注释所述。将本教程中提供的值(n_cells_ per_location = 30)更改为在您的组织中观察到的值。

# create and train the model
mod = cell2location.models.Cell2location(
    adata_vis, cell_state_df=inf_aver,
    # the expected average cell abundance: tissue-dependent
    # hyper-prior which can be estimated from paired histology:
    N_cells_per_location=30,
    # hyperparameter controlling normalisation of
    # within-experiment variation in RNA detection (using default here):
    detection_alpha=200
)

mod.train(max_epochs=30000,
          # train using full data (batch_size=None)
          batch_size=None,
          # use all data points in training because
          # we need to estimate cell abundance at all locations
          train_size=1,
          use_gpu=True)

# plot ELBO loss history during training, removing first 100 epochs from the plot
mod.plot_history(1000)
plt.legend(labels=['full data training']);

image.png

# In this section, we export the estimated cell abundance (summary of the posterior distribution).
adata_vis = mod.export_posterior(
    adata_vis, sample_kwargs={'num_samples': 1000, 'batch_size': mod.adata.n_obs, 'use_gpu': True}
)

# Save model
mod.save(f"{run_name}", overwrite=True)

# mod = cell2location.models.Cell2location.load(f"{run_name}", adata_vis)

# Save anndata object with results
adata_file = f"{run_name}/sp.h5ad"
adata_vis.write(adata_file)
adata_file

image.png

#模型和输出 h5ad 可以像下面这样加载:
mod = cell2location.models.Cell2location.load(f"{run_name}", adata_vis)
adata_file = f"{run_name}/sp.h5ad"
adata_vis = sc.read_h5ad(adata_file)

# Examine reconstruction accuracy to assess if there are any issues with mapping
# the plot should be roughly diagonal, strong deviations will signal problems
mod.plot_QC()

image.png

当整合多个空间批次时和当处理载玻片中检测到的 RNA 差异很大的数据集时(这在组织学上无法用高细胞密度来解释) ，评估 cell2location 是否使这些影响恢复正常是很重要的。
期望看到的是不同批次中有相似的总细胞数和不同的RNA检测灵敏度（这都可以通过cell2location评估）。
期望组织学中的总细胞数量反映出高细胞密度。

Visualising cell abundance in spatial coordinates

# add 5% quantile, representing confident cell abundance, 'at least this amount is present',
# to adata.obs with nice names for plotting
adata_vis.obs[adata_vis.uns['mod']['factor_names']] = adata_vis.obsm['q05_cell_abundance_w_sf']

# select one slide
from cell2location.utils import select_slide
slide = select_slide(adata_vis, 'V1_Human_Lymph_Node')

# plot in spatial coordinates
with mpl.rc_context({'axes.facecolor':  'black','figure.figsize': [4.5, 5]}):
    sc.pl.spatial(slide, cmap='magma',
                  # show first 8 cell types
                  color=['B_Cycling', 'B_GC_LZ', 'T_CD4+_TfH_GC', 'FDC','B_naive', 'T_CD4+_naive', 'B_plasma', 'Endo'],
                  ncols=4, size=1.3,
                  img_key='hires',
                  # limit color scale at 99.2% quantile of cell abundance
                  vmin=0, vmax='p99.2',
save=True
                 )

image.png

# Now we use cell2location plotter that allows showing multiple cell types in one panel
from cell2location.plt import plot_spatial

# select up to 6 clusters
clust_labels = ['T_CD4+_naive', 'B_naive', 'FDC']
clust_col = ['' + str(i) for i in clust_labels] # in case column names differ from labels

slide = select_slide(adata_vis, 'V1_Human_Lymph_Node')

import matplotlib.pyplot as plt
with mpl.rc_context({'figure.figsize': (15, 15)}):
    fig = plot_spatial(
        adata=slide,
        # labels to show on a plot
        color=clust_col, labels=clust_labels,
        show_img=True,
        # 'fast' (white background) or 'dark_background'
        style='fast',
        # limit color scale at 99.2% quantile of cell abundance
        max_color_quantile=0.992,
        # size of locations (adjust depending on figure size)
        circle_diameter=6,
        colorbar_position='right'
    )
    plt.savefig('showing_multiple_cell_types_in_one_panel.pdf')

image.png

下游分析

基于 Leiden 聚类的离散组织区域识别

我们通过使用由cell2location估计的细胞丰度来确定细胞组成不同的组织区域。
我们通过估计每种细胞类型的细胞丰度来聚集 Visium 点，从而找到组织区域。
首先构造了一个 k 邻近的 neigbour (KNN)图表示细胞丰度估计值中位置的相似性，然后应用 Leiden 聚类算法进行聚类。邻近区域的数量应适应数据集的大小和解剖学定义区域的大小(例如海马区域相当小，因此可能被大n_neighbors所掩盖)。这可以做一个范围的 KNN 近邻和Leiden聚类分辨率，直到一个聚类匹配的组织解剖结构获得。
聚类是跨所有 Visium 部分/批次共同完成的，因此区域身份是直接可比的。当多个批次之间存在强烈的技术效应时(这里不是这种情况) ，sc.external.pp.bbknn 原则上可以用来解释 KNN 构建过程中的这些效应。
生成的集群保存在 adata _ vis. obs [‘ region _ cluster’]中

# compute KNN using the cell2location output stored in adata.obsm
sc.pp.neighbors(adata_vis, use_rep='q05_cell_abundance_w_sf',
                n_neighbors = 15)

# Cluster spots into regions using scanpy
sc.tl.leiden(adata_vis, resolution=1.1)

# add region as categorical variable
adata_vis.obs["region_cluster"] = adata_vis.obs["leiden"].astype("category")

# compute UMAP using KNN graph based on the cell2location output
sc.tl.umap(adata_vis, min_dist = 0.3, spread = 1)

# show regions in UMAP coordinates
with mpl.rc_context({'axes.facecolor':  'white',
                     'figure.figsize': [8, 8]}):
    sc.pl.umap(adata_vis, color=['region_cluster'], size=30,
               color_map = 'RdPu', ncols = 2, legend_loc='on data',
               legend_fontsize=20)
    sc.pl.umap(adata_vis, color=['sample'], size=30,
               color_map = 'RdPu', ncols = 2,
               legend_fontsize=20)

# plot in spatial coordinates
with mpl.rc_context({'axes.facecolor':  'black',
                     'figure.figsize': [4.5, 5]}):
    sc.pl.spatial(adata_vis, color=['region_cluster'],
                  size=1.3, img_key='hires', alpha=0.5)

image.png

利用基质因子分解(NMF)识别细胞区间/组织区域

在这里，我们使用 cell2location 映射结果来确定细胞类型的空间共存，以便更好地理解组织组织和预测细胞相互作用。我们用cell2location进行了细胞类型丰度的非负矩阵分解估计。与将 NMF 应用于常规 scRNA-seq 的有点相似，加性 NMF 分解产生了一组空间细胞类型丰度分布图，这些分布图可以捕捉共同定位的细胞类型。这种基于NMF 的分解自然地解释了多种细胞类型和微环境可以在相同的 Visium 位置共存的事实，同时跨组织区域(例如个体生发中心)共享信息。
提示在实际应用中，最好对一系列因子 R = 5，..，30进行 NMF 训练。并选择 R作为捕获细小组织区域和分裂已知部分之间的平衡。如果你想找到一些最独特的细胞区间，使用小一点的R。如果你想找到非常强的同位信号，并假设大多数细胞类型不同位，使用大一点的R。
下面我们将展示如何执行此分析。为了帮助这个分析，我们把这个分析包装成一个管道，这个管道可以自动训练不同R的 NMF 模型:

from cell2location import run_colocation
res_dict, adata_vis = run_colocation(
    adata_vis,
    model_name='CoLocatedGroupsSklearnNMF',
    train_args={
      'n_fact': np.arange(11, 13), # IMPORTANT: use a wider range of the number of factors (5-30)
      'sample_name_col': 'sample', # columns in adata_vis.obs that identifies sample
      'n_restarts': 3 # number of training restarts
    },
    export_args={'path': f'{run_name}/CoLocatedComb/'}
)

image.png

对于每个factor number，模型生成以下文件夹输出列表:
cell_type_fractions_heatmap/:a dot plot of the estimated NMF weights of cell types (rows) across NMF components (columns)
cell_type_fractions_mean/:the data used for dot plot
factor_markers/: tables listing top 10 cell types most speficic to each NMF factor
models/:saved NMF models
predictive_accuracy/:2D histogram plot showing how well NMF explains cell2location output
spatial/:NMF weights across locatinos in spatial coordinates
location_factors_mean/:the data used for the plot in spatial coordiantes
stability_plots/:stability of NMF weights between training restarts
要检查的关键输出是 cell _ type _ fraction _ heatmap/中的文件，其中显示了 NMF 组件(列)中对应于细胞间隔的各种类型(行)的估计 NMF 权重的点图。显示的是相对权重，每个单元类型的组件之间的规范化。

# Here we plot the NMF weights (Same as saved to `cell_type_fractions_heatmap`)
res_dict['n_fact12']['mod'].plot_cell_type_loadings()

image.png

高级应用

估计空间数据中每个基因的细胞类型特异性表达

为此，我们采用了由 Cable 等人提出的估计条件期望表达式的方法。使用 cell2location，我们可以查看后验概率分布，而不仅仅是细胞型特定表达式的估计值

# Compute expected expression per cell type
expected_dict = mod.module.model.compute_expected_per_cell_type(
    mod.samples["post_sample_q05"], mod.adata
)

# Add to anndata layers
for i, n in enumerate(mod.factor_names_):
    adata_vis.layers[n] = expected_dict['mu'][i]

# Save anndata object with results
adata_file = f"{run_name}/sp.h5ad"
adata_vis.write(adata_file)
adata_file

image.png

# Look at cell type specific expression in spatial coordinates,
# Here we highlight CD3D, pan T-cell marker expressed by
# 2 subtypes of T cells in distinct locations but not expressed by co-located B cells
ctypes = ['T_CD4+_TfH_GC', 'T_CD4+_naive', 'B_GC_LZ']
genes = ['CD3D', 'CR2']

with mpl.rc_context({'axes.facecolor':  'black'}):
    # select one slide
    slide = select_slide(adata_vis, 'V1_Human_Lymph_Node')

    from tutorial_utils import plot_genes_per_cell_type
    plot_genes_per_cell_type(slide, genes, ctypes);

image.png

与后验概率计算一起，计算任意分位数

# Get posterior distribution samples for specific variables
samples_w_sf = mod.sample_posterior(num_samples=1000, use_gpu=True, return_samples=True,
                                    batch_size=2020,
                                    return_sites=['w_sf', 'm_g', 'u_sf_mRNA_factors'])
# samples_w_sf['posterior_samples'] contains 1000 samples as arrays with dim=(num_samples, ...)
samples_w_sf['posterior_samples']['w_sf'].shape

# Compute any quantile of the posterior distribution
medians = mod.posterior_quantile(q=0.5, use_gpu = True)

with mpl.rc_context({'axes.facecolor':  'white',
                     'figure.figsize': [5, 5]}):
    plt.scatter(medians['w_sf'].flatten(), mod.samples['post_sample_means']['w_sf'].flatten());
    plt.xlabel('median');
    plt.ylabel('mean');

image.png

参考：https://cell2location.readthedocs.io/en/latest/notebooks/cell2location_tutorial.html

最后编辑于：2022.04.06 12:02:57

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 206,214评论 6赞 481
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 88,307评论 2赞 382
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 152,543评论 0赞 341
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 55,221评论 1赞 279
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 64,224评论 5赞 371
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 49,007评论 1赞 284
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 38,313评论 3赞 399
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,956评论 0赞 259
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 43,441评论 1赞 300
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,925评论 2赞 323
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 38,018评论 1赞 333
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,685评论 4赞 322
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 39,234评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 30,240评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,464评论 1赞 261
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 45,467评论 2赞 352
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,762评论 2赞 345

cell2location学习笔记

加载需要的包

加载 Visium 和 scRNA-seq 参考数据

参考细胞类型特征的估计(NB 回归)

检查QC后的图

Cell2location: 空间比对

Visualising cell abundance in spatial coordinates

下游分析

基于 Leiden 聚类的离散组织区域识别

利用基质因子分解(NMF)识别细胞区间/组织区域

高级应用

估计空间数据中每个基因的细胞类型特异性表达

与后验概率计算一起，计算任意分位数

推荐阅读更多精彩内容