单细胞转录组数据分析|| scVelo 教程：RNA速率分析工具

什么是RNA 速率（RNA velocity）

RNA速率能够通过叠加剪接信息来推断细胞分化的方向性。

随着令人震惊的发现，未剪接和剪接的mRNA丰度可以在标准的单细protocols胞中进行区分，La Manno等人(2018)引入了RNA速度(RNA velocity)的概念。通过将测量结果与潜在的mRNA剪接动力学联系起来，探索定向轨迹的推断：特定基因的转录诱导导致(新转录的)前体未剪接mRNA的增加，而相反，转录的抑制或缺失导致未剪接mRNA的减少。因此，通过将未剪接的mRNA与成熟的剪接mRNA进行区分，可以近似地得到mRNA丰度的变化，即其时间导数，即RNA速度。跨mrna的速度组合可以用来估计单个细胞的未来状态。细胞的运动是通过将速度投影到低维嵌入中来实现的。

scVelo如何估计RNA速度

RNA速度估计目前有三种方法:

稳态/确定性模型(在velocyto中使用)
随机模型(使用二阶矩)
动态模型(使用基于概率的框架)

velocyto中使用的稳态/确定性模型对速度的估计如下:在假定转录阶段(诱导和抑制)持续足够长的时间以达到稳态平衡(活跃和不活跃)的情况下，速度被量化为实际观测值如何偏离稳态平衡。平衡mRNA水平近似于在假定的上下分位数的稳定状态下的线性回归。这种简化是通过假设一个跨基因的通用剪接率和数据中反映的稳态mRNA水平来实现的。这可能导致速度估计和细胞状态的错误，因为这些假设经常被违背，特别是当一个种群包含多个异质亚种群动态时。

随机模型的目标是更好地捕捉稳态，但与稳态模型的假设相同。它是通过处理转录，剪接和降解作为概率事件，从而纳入二阶矩。也就是说，稳态水平不仅与mRNA水平近似，而且与内在表达变异性近似。

动态模型(最强大的，同时计算最耗资源)解决了每个基因的剪接动力学的全部动态。因此，它使RNA速度适应广泛变化的规格，如非平稳群体，因为它不依赖于限制一个共同的剪接率或稳态被采样。通过迭代估计反应速率和潜在细胞特异性变量的可识别参数，即转录状态和细胞内潜伏时间，在基于概率的期望最大化框架中求解剪接动力学。更准确地说，我们明确地模拟了四种转录状态，以考虑基因活动的所有可能配置：两个动态瞬态(诱导和抑制)和两个稳态(活跃和不活跃)在每个动态转变后可能达到的状态。对于每个状态下的每个观测值，计算一个模型的最优潜伏期，以获得一个映射到非剪接/剪接动态的学习曲线上。然后，细胞到曲线的映射产生各自状态的可能性，并通过最大可能的整体反应速率参数。

这产生了更一致的速度估计和更好的识别转录状态。该模型进一步能够以一种基于概率的方式系统地识别动态驱动基因，从而找到控制细胞命运转变的关键驱动因素。此外，动态模型推断了一个普遍的细胞内潜伏时间共享的基因，使相关基因和识别转录变化的机制(如分支点)。

为了得到最好的结果，我们显然推荐使用更优的动态模型。如果运行时很重要，我们建议使用随机模型，它是用来近似动态模型的。随机模型在30k细胞上用时不到一分钟。然而，动力仍然可以花一个小时，然而，提高效率正在进行中。

安装

#conda install -c conda-forge numba pytables louvain
pip install -U scvelo

## or # git clone git+https://github.com/theislab/scvelo
## pip install -e scvelo

比对

为了获得MRNA剪切与为剪切的信息，我们需要比对后的bam文件，比对可以采用不同的方法:

scVelo in action

如果我们已经熟悉scanpy的基本操作，理解scVelo 就不是难事了，因为二者的数据结构是一样的。

# -*- coding: utf-8 -*-
"""
Created on Fri Mar 27 08:34:49 2020
https://scvelo.readthedocs.io/
@author: Administrator
"""
import scvelo as scv
import scanpy as sc

scv.logging.print_version()
scv.settings.verbosity = 3  # show errors(0), warnings(1), info(2), hints(3)
scv.settings.set_figure_params('scvelo')  # for beautified visualization
scv.settings.set_figure_params('scvelo')


# filename = 'E:\learnscanpy\write\pbmc3k.h5ad'
# adata = scv.read(filename, cache=False)

loomf = 'E:\learnscanpy\data\possorted_genome_bam_AYDSF.loom'
adata = scv.read(loomf, cache=False)
adata.var_names_make_unique

这里的possorted_genome_bam_AYDSF.loom是我们10X的数据比对后用velocyto处理得到的loom文件。

我们来看看数据结构：

adata
Out[6]: 
AnnData object with n_obs × n_vars = 7292 × 33694 
    var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand'
    layers: 'matrix', 'ambiguous', 'spliced', 'unspliced'

adata.var
Out[7]: 
                     Accession Chromosome       End     Start Strand
FAM138A        ENSG00000237613          1     36081     34554      -
RP11-34P13.7   ENSG00000238009          1    133723     89295      -
RP11-34P13.8   ENSG00000239945          1     91105     89551      -
RP11-34P13.14  ENSG00000239906          1    140339    139790      -
FO538757.2     ENSG00000279457          1    200322    184923      -
                       ...        ...       ...       ...    ...
AC006386.1     ENSG00000279115          Y  25308107  25307702      +
AC006328.4     ENSG00000280301          Y  25473714  25463994      +
CSPG4P1Y       ENSG00000240450          Y  25486705  25482908      +
CDY1           ENSG00000172288          Y  25624902  25622162      +
TTTY3          ENSG00000231141          Y  25733388  25728490      +

[33694 rows x 5 columns]

adata.layers['matrix']
Out[8]: 
<7292x33694 sparse matrix of type '<class 'numpy.float32'>'
    with 17201128 stored elements in Compressed Sparse Row format>

数据预处理

检测基因选择(用最少的计数检测)和高变异性(离散度)。-根据每个细胞的初始大小对其进行归一化，并对X取对数。


scv.pp.filter_genes(adata, min_shared_counts=10)
scv.pp.normalize_per_cell(adata)
scv.pp.filter_genes_dispersion(adata, n_top_genes=3000)
scv.pp.log1p(adata)

此外，我们需要在PCA空间中计算最近邻的一阶和二阶矩(基本上是均值和无中心方差)。确定性速度估计需要一阶矩，而随机估计也需要二阶矩。

scv.pp.filter_and_normalize(adata, min_shared_counts=30, n_top_genes=2000)
scv.pp.moments(adata, n_pcs=30, n_neighbors=30)

adata
Out[15]: 
AnnData object with n_obs × n_vars = 7292 × 1999 
    obs: 'initial_size_unspliced', 'initial_size_spliced', 'initial_size', 'n_counts'
    var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand', 'means', 'dispersions', 'dispersions_norm'
    uns: 'pca', 'neighbors', 'connectivities_key', 'distances_key'
    obsm: 'X_pca'
    varm: 'PCs'
    layers: 'matrix', 'ambiguous', 'spliced', 'unspliced', 'Ms', 'Mu'
    obsp: 'distances', 'connectivities'

计算velocity 和velocity 图

通过拟合前体(未剪接的)和成熟(剪接的)mRNA丰度之间的比值来获得基因特异性速度，该比值很好地解释了稳定状态(恒定转录状态)，然后计算观察到的丰度如何偏离稳定状态下的期望。

scv.tl.velocity(adata)

computing velocities
    finished (0:00:01) --> added 
    'velocity', velocity vectors for each individual cell (adata.layers)

这计算了在高维空间中势元跃迁与速度矢量的(余弦)相关性。结果的velocity 图nobs×nobs，并总结了可能的细胞状态变化(从一个细胞到另一个细胞的过渡)，这些变化通过速度向量得到了很好的解释。如果你设置approx=True，它是在一个减少的PCA空间上计算的，有50个PC。

然后，通过在余弦相关上应用高斯核，将速度图转换为一个过渡矩阵，该高斯核赋予与速度向量相关的细胞状态变化的高概率。可以通过scv.tl.transition_matrix访问马尔可夫转移矩阵。由此产生的转换矩阵可用于以下所示的各种应用程序。例如，它通过简单地应用相对于转移概率的平均转移，即scv.tl. velocity_embedded，来将速度放入低维嵌入中。此外，通过scv.tl.terminal_states,我们可以沿着马尔科夫链追溯细胞的起源和潜在的命运，从而获得轨迹中的根细胞和终点。

scv.tl.velocity_graph(adata)

computing velocity graph

100%    finished (0:00:04) --> added 
    'velocity_graph', sparse matrix with cosine correlations (adata.uns)
Out[18]: 
AnnData object with n_obs × n_vars = 7292 × 1999 
    obs: 'initial_size_unspliced', 'initial_size_spliced', 'initial_size', 'n_counts', 'velocity_self_transition'
    var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand', 'means', 'dispersions', 'dispersions_norm', 'velocity_gamma', 'velocity_r2', 'velocity_genes'
    uns: 'pca', 'neighbors', 'connectivities_key', 'distances_key', 'velocity_settings', 'velocity_graph', 'velocity_graph_neg'
    obsm: 'X_pca'
    varm: 'PCs'
    layers: 'matrix', 'ambiguous', 'spliced', 'unspliced', 'Ms', 'Mu', 'velocity', 'variance_velocity'
    obsp: 'distances', 'connectivities'

可视化结果

在可视化之前我们先计算一下umap:

scv.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
sc.tl.leiden(adata,adjacency = adata.obsp['connectivities'])
# 如果是pip安装的，adjacency = adata.obsp['connectivities']就不要了。

sc.tl.leiden(adata)

vdata =  adata 
#adata.uns['neighbors']['indices']
#adata.uns['connectivities_key']
#adata.obsp['connectivities'].tocoo()
scv.tl.umap(adata)

scv.pl.velocity_embedding_stream(adata, basis='umap', color=['leiden','initial_size_spliced'])

computing velocity embedding
    finished (0:00:01) --> added
    'velocity_umap', embedded velocity vectors (adata.obsm)

基因排序

help(scv.tl.rank_velocity_genes) # 改造函数
Help on function rank_velocity_genes in module scvelo.tools.rank_velocity_genes:

rank_velocity_genes(data, vkey='velocity', n_genes=10, groupby=None, match_with=None, resolution=None, min_counts=None, min_r2=None, min_dispersion=None, min_likelihood=None, copy=False)
    Rank genes for velocity characterizing groups.
    
    This applies a differential expression test (Welch t-test with overestimated variance to be conservative) on
    velocity expression, to find genes in a cluster that show dynamics that is transcriptionally regulated differently
    compared to all other clusters (e.g. induction in that cluster and homeostasis in remaining population).
    If no clusters are given, it priorly computes velocity clusters by applying louvain modularity on velocity expression.
    
    .. code:: python
    
        scv.tl.rank_velocity_genes(adata, groupby='clusters')
        scv.pl.scatter(adata, basis=adata.uns['rank_velocity_genes']['names']['Beta'][:3])
        pd.DataFrame(adata.uns['rank_velocity_genes']['names']).head()
    
    .. image:: https://user-images.githubusercontent.com/31883718/69626017-11c47980-1048-11ea-89f4-df3769df5ad5.png
       :width: 600px
    
    .. image:: https://user-images.githubusercontent.com/31883718/69626572-30774000-1049-11ea-871f-e8a30c42f10e.png
       :width: 600px
    
    Arguments
    ----------
    data : :class:`~anndata.AnnData`
        Annotated data matrix.
    vkey: `str` (default: `'velocity'`)
        Key of velocities computed in `tl.velocity`
    n_genes : `int`, optional (default: 100)
        The number of genes that appear in the returned tables.
    groupby: `str`, `list` or `np.ndarray` (default: `None`)
        Key of observations grouping to consider.
    match_with: `str` or `None` (default: `None`)
        adata.obs key to separatively rank velocities on.
    resolution: `str` or `None` (default: `None`)
        Resolution for louvain modularity.
    min_counts: `float` (default: None)
        Minimum count of genes for consideration.
    min_r2: `float` (default: None)
        Minimum r2 value of genes for consideration.
    min_dispersion: `float` (default: None)
        Minimum dispersion norm value of genes for consideration.
    min_likelihood: `float` between `0` and `1` or `None` (default: `None`)
        Only rank velocity of genes with a likelihood higher than min_likelihood.
    copy: `bool` (default: `False`)
        Return a copy instead of writing to data.
    
    Returns
    -------
    Returns or updates `data` with the attributes
    rank_velocity_genes : `.uns`
        Structured array to be indexed by group id storing the gene
        names. Ordered according to scores.
    velocity_score : `.var`
        Storing the score for each gene for each group. Ordered according to scores.

scv.tl.rank_velocity_genes(adata, match_with='leiden', resolution=.4)

import pandas as pd
pd.DataFrame(adata.uns['rank_velocity_genes']['names']).head()

Out[27]: 
        0 0       1 1      4 2       7 3  ...      0 9      0 10  10 11      5 12
0      GNLY      NKG7      HCK      CCL5  ...     CD36  HLA-DQB1  PTGS1      LST1
1      MNDA     RGS18    SMCO4     CD160  ...    DMXL2  HLA-DQA1   CD22      CD22
2   CLEC10A   METTL17    DMXL2    SH2D1B  ...     CD74    CCDC50  RGS18    CCDC50
3  HLA-DPA1  HLA-DPA1  CYP27A1     RGS18  ...  CLEC10A   TSPAN13    MMD  HLA-DQB1
4   HLA-DRA     DDX11      FGR  HLA-DPA1  ...      MMD      CD22    FGR  HLA-DQA1

[5 rows x 13 columns]

scv.pl.velocity_embedding_grid(adata, color='GNLY',
                               layer=['velocity', 'spliced'], arrow_size=1.5)

特定基因的剪切状态，velocity动态以及表达情况，直接整理成可以发表的形式：

var_names = ['GNLY', 'RGS18', 'DMXL2', 'SH2D1B']
scv.pl.velocity(adata, var_names=var_names, colorbar=True, ncols=2)

help(scv.pl.velocity_graph)
scv.pl.velocity_graph(adata,color='GNLY')

动力学分析

 help(scv.tl.recover_dynamics)
Help on function recover_dynamics in module scvelo.tools.dynamical_model:

recover_dynamics(data, var_names='velocity_genes', n_top_genes=None, max_iter=10, assignment_mode='projection', t_max=None, fit_time=True, fit_scaling=True, fit_steady_states=True, fit_connected_states=None, fit_basal_transcription=None, use_raw=False, load_pars=None, return_model=None, plot_results=False, steady_state_prior=None, add_key='fit', copy=False, **kwargs)
    Recovers the full splicing kinetics of specified genes.
    
    The model infers transcription rates, splicing rates, degradation rates,
    as well as cell-specific latent time and transcriptional states, estimated iteratively by expectation-maximization.
    
    .. image:: https://user-images.githubusercontent.com/31883718/69636459-ef862800-1056-11ea-8803-0a787ede5ce9.png
    
    Arguments
    ---------
    data: :class:`~anndata.AnnData`
        Annotated data matrix.
    var_names: `str`,  list of `str` (default: `'velocity_genes`)
        Names of variables/genes to use for the fitting.
    n_top_genes: `int` or `None` (default: `None`)
        Number of top velocity genes to use for the dynamical model.
    max_iter:`int` (default: `10`)
        Maximal iterations in the EM-Algorithm.
    assignment_mode: `str` (default: `projection`)
        Determined how times are assigned to observations.
        If `projection`, observations are projected onto the model trajectory.
        Else uses an inverse approximating formula.
    t_max: `float`, `False` or `None` (default: `None`)
        Total range for time assignments.
    fit_scaling: `bool` or `float` or `None` (default: `True`)
        Whether to fit scaling between unspliced and spliced or keep initially given scaling fixed.
    fit_time: `bool` or `float` or `None` (default: `True`)
        Whether to fit time or keep initially given time fixed.
    fit_steady_states: `bool` or `None` (default: `True`)
        Allows fitting of observations to steady states next to repression and induction.
    fit_connected_states: `bool` or `None` (default: `None`)
        Restricts fitting to neighbors given by connectivities.
    fit_basal_transcription: `bool` or `None` (default: `None`)
        Enables model to incorporate basal transcriptions.
    use_raw: `bool` or `None` (default: `None`)
        if True, use .layers['sliced'], else use moments from .layers['Ms']
    load_pars: `bool` or `None` (default: `None`)
        Load parameters from past fits.
    return_model: `bool` or `None` (default: `True`)
        Whether to return the model as :DynamicsRecovery: object.
    plot_results: `bool` or `None` (default: `False`)
        Plot results after parameter inference.
    steady_state_prior: list of `bool` or `None` (default: `None`)
        Mask for indices used for steady state regression.
    add_key: `str` (default: `'fit'`)
        Key to add to parameter names, e.g. 'fit_t' for fitted time.
    copy: `bool` (default: `False`)
        Return a copy instead of writing to `adata`.
    
    Returns
    -------
    Returns or updates `adata`

scv.tl.recover_dynamics(adata)
scv.tl.velocity(adata, mode='dynamical')
scv.tl.velocity_graph(adata)

adata
Out[37]: 
AnnData object with n_obs × n_vars = 7292 × 1999 
    obs: 'initial_size_unspliced', 'initial_size_spliced', 'initial_size', 'n_counts', 'velocity_self_transition', 'leiden', 'velocity_clusters'
    var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand', 'means', 'dispersions', 'dispersions_norm', 'velocity_gamma', 'velocity_r2', 'velocity_genes', 'velocity_score', 'fit_alpha', 'fit_beta', 'fit_gamma', 'fit_t_', 'fit_scaling', 'fit_std_u', 'fit_std_s', 'fit_likelihood', 'fit_u0', 'fit_s0', 'fit_pval_steady', 'fit_steady_u', 'fit_steady_s', 'fit_alignment_scaling', 'fit_r2'
    uns: 'pca', 'neighbors', 'connectivities_key', 'distances_key', 'velocity_settings', 'velocity_graph', 'velocity_graph_neg', 'leiden', 'umap', 'leiden_colors', 'rank_velocity_genes', 'recover_dynamics'
    obsm: 'X_pca', 'X_umap', 'velocity_umap'
    varm: 'PCs', 'loss'
    layers: 'matrix', 'ambiguous', 'spliced', 'unspliced', 'Ms', 'Mu', 'velocity', 'variance_velocity', 'fit_t', 'fit_tau', 'fit_tau_', 'velocity_u'
    obsp: 'distances', 'connectivities'

scv.pl.scatter(adata, color='latent_time', fontsize=24, size=100,
               color_map='gnuplot', perc=[2, 98], colorbar=True, rescale_color=[0,1])

top_genes = adata.var_names[adata.var.fit_likelihood.argsort()[::-1]][:300]
scv.pl.heatmap(adata, var_names=top_genes, tkey='latent_time', n_convolve=100, col_color='leiden')

 top_genes
Out[43]: 
Index(['TBXAS1', 'NKG7', 'PRAM1', 'HCK', 'SMCO4', 'CD22', 'SHTN1', 'CD74',
       'HLA-DQA1', 'HLA-DRA',
       ...
       'GSPT1', 'GALNS', 'NDUFAB1', 'NDUFV2-AS1', 'DHX40', 'PPM1D', 'BCAS3',
       'METTL2A', 'TANC2', 'DDX42'],
      dtype='object', length=300)

scv.pl.scatter(adata, basis=top_genes[:10], legend_loc='none',
               size=80, frameon=False, ncols=5, fontsize=20,color='leiden')

scv.pl.scatter(adata, x='latent_time', y=top_genes[:4], fontsize=16, size=100,
               n_convolve=None, frameon=False, legend_loc='none',color='leiden')

最后编辑于：2020.04.02 23:47:18

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 206,723评论 6赞 481
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 88,485评论 2赞 382
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 152,998评论 0赞 344
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 55,323评论 1赞 279
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 64,355评论 5赞 374
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 49,079评论 1赞 285
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 38,389评论 3赞 400
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 37,019评论 0赞 259
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 43,519评论 1赞 300
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,971评论 2赞 325
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 38,100评论 1赞 333
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,738评论 4赞 324
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 39,293评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 30,289评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,517评论 1赞 262
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 45,547评论 2赞 354
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,834评论 2赞 345