单细胞数据质控-双细胞预测-scrublet使用教程

作者：ahworld
链接：单细胞数据质控-双细胞预测-scrublet使用教程
来源：微信公众号-seqyuan
著作权归作者所有，任何形式的转载都请联系作者。

在分析scRNA-seq数据之前，我们必须确保所有细胞barcode均与活细胞相对应。通常基于三个QC协变量执行细胞QC（Quality control）：

每个barcode的数量
每个barcode对应的基因数量
每个barcode的数量中线粒体基因的占比

通过这些QC协变量的分布图，可以通过阈值过滤掉离群峰。

这些异常的barcodes对应着：

死细胞
细胞膜破损的细胞
双细胞(doublets)

例如，barcodes计数深度低，检测到的基因很少且线粒体计数高，这表明细胞的细胞质mRNA已通过破膜渗出，因此，仅位于线粒体中的mRNA仍然在细胞内。相反，具有非预期高计数和检测到大量基因的细胞可能代表双细胞。

检测scRNA-seq中双细胞的分析鉴定工具总结了以下几种：

这些双细胞的分析鉴定工具在2019年发表的《单细胞数据分析最佳实践》中也有推荐(Luecken M D et al, 2019)

本期将对scrublet的使用做一个详细介绍

scrublet的使用

scrublet文献:Wolock S L, Lopez R, Klein A M. Scrublet: computational identification of cell doublets in single-cell transcriptomic data[J]. Cell systems, 2019, 8(4): 281-291. e9.
scrublet教程参考来源链接

安装

scrublet为python语言编写的包，用以下pip命令安装就可以。

pip install scrublet

使用注意事项

处理来自多个样本的数据时，请分别对每个样本运行Scrublet。因为Scrublet旨在检测由两个细胞的随机共封装形成的technical doublets，所以在merged数据集上可能会表现不佳，因为细胞类型比例不代表任何单个样品；
检查doublet score阈值是否合理，并在必要时进行手动调整。并不是所有情况向下doublet score的直方分布图都是呈现标准的双峰；
UMAP或t-SNE可视化的结果中，预测的双细胞应该大体上共定位（可能在多个细胞群中）。如果不是，则可能需要调整doublet score阈值，或更改预处理参数以更好地解析数据中存在的细胞状态。

准备工作

数据准备

下载来自10X Genomics8k的PBMC数据集并解压。

wget http://cf.10xgenomics.com/samples/cell-exp/2.1.0/pbmc8k/pbmc8k_filtered_gene_bc_matrices.tar.gz
tar xfz pbmc8k_filtered_gene_bc_matrices.tar.gz

numpy兼容性报错修复

高版本numpy带来的cannot import name '_validate_lenghs'报错修复方案

scrublet包的开发依赖的是比较低版本的numpy，会用到arraypad.py中的_validate_lenghs函数，这个函数在比较高的numpy版本中已经弃用，如果安装的是高版本numpy，可能在运行scrublet时会有报错导致中断：cannot import name '_validate_lenghs'。

考虑到一般其他软件包依赖高版本numpy的情况比较多，不想再降低numpy的版本，所以让scrublet能够运行下去的修复方案为如下：

打开终端，进入Python环境，输入以下代码，查看Python安装位置。

import sys
print(sys.path)

找到arraypad.py的位置 ~/anaconda3/lib/python3.6/site-packages/numpy/lib/arraypad.py，打开文件，在文件最后添加以下代码，保存退出，问题解决。

def _normalize_shape(ndarray, shape, cast_to_int=True):
    """
    Private function which does some checks and normalizes the possibly
    much simpler representations of ‘pad_width‘, ‘stat_length‘,
    ‘constant_values‘, ‘end_values‘.

    Parameters
    ----------
    narray : ndarray
        Input ndarray
    shape : {sequence, array_like, float, int}, optional
        The width of padding (pad_width), the number of elements on the
        edge of the narray used for statistics (stat_length), the constant
        value(s) to use when filling padded regions (constant_values), or the
        endpoint target(s) for linear ramps (end_values).
        ((before_1, after_1), ... (before_N, after_N)) unique number of
        elements for each axis where `N` is rank of `narray`.
        ((before, after),) yields same before and after constants for each
        axis.
        (constant,) or val is a shortcut for before = after = constant for
        all axes.
    cast_to_int : bool, optional
        Controls if values in ``shape`` will be rounded and cast to int
        before being returned.

    Returns
    -------
    normalized_shape : tuple of tuples
        val                               => ((val, val), (val, val), ...)
        [[val1, val2], [val3, val4], ...] => ((val1, val2), (val3, val4), ...)
        ((val1, val2), (val3, val4), ...) => no change
        [[val1, val2], ]                  => ((val1, val2), (val1, val2), ...)
        ((val1, val2), )                  => ((val1, val2), (val1, val2), ...)
        [[val ,     ], ]                  => ((val, val), (val, val), ...)
        ((val ,     ), )                  => ((val, val), (val, val), ...)

    """
    ndims = ndarray.ndim

    # Shortcut shape=None
    if shape is None:
        return ((None, None), ) * ndims

    # Convert any input `info` to a NumPy array
    shape_arr = np.asarray(shape)

    try:
        shape_arr = np.broadcast_to(shape_arr, (ndims, 2))
    except ValueError:
        fmt = "Unable to create correctly shaped tuple from %s"
        raise ValueError(fmt % (shape,))

    # Cast if necessary
    if cast_to_int is True:
        shape_arr = np.round(shape_arr).astype(int)

    # Convert list of lists to tuple of tuples
    return tuple(tuple(axis) for axis in shape_arr.tolist())


def _validate_lengths(narray, number_elements):
    """
    Private function which does some checks and reformats pad_width and
    stat_length using _normalize_shape.

    Parameters
    ----------
    narray : ndarray
        Input ndarray
    number_elements : {sequence, int}, optional
        The width of padding (pad_width) or the number of elements on the edge
        of the narray used for statistics (stat_length).
        ((before_1, after_1), ... (before_N, after_N)) unique number of
        elements for each axis.
        ((before, after),) yields same before and after constants for each
        axis.
        (constant,) or int is a shortcut for before = after = constant for all
        axes.

    Returns
    -------
    _validate_lengths : tuple of tuples
        int                               => ((int, int), (int, int), ...)
        [[int1, int2], [int3, int4], ...] => ((int1, int2), (int3, int4), ...)
        ((int1, int2), (int3, int4), ...) => no change
        [[int1, int2], ]                  => ((int1, int2), (int1, int2), ...)
        ((int1, int2), )                  => ((int1, int2), (int1, int2), ...)
        [[int ,     ], ]                  => ((int, int), (int, int), ...)
        ((int ,     ), )                  => ((int, int), (int, int), ...)

    """
    normshp = _normalize_shape(narray, number_elements)
    for i in normshp:
        chk = [1 if x is None else x for x in i]
        chk = [1 if x >= 0 else -1 for x in chk]
        if (chk[0] < 0) or (chk[1] < 0):
            fmt = "%s cannot contain negative values."
            raise ValueError(fmt % (number_elements,))
    return normshp

scrublet使用教程

我的测试执行环境是MACOS jupyter notebook，以下代码为python包的载入和画图设置：

%matplotlib inline
import scrublet as scr
import scipy.io
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd

plt.rcParams['font.family'] = 'sans-serif'
plt.rcParams['font.sans-serif'] = 'Arial'
plt.rc('font', size=14)
plt.rcParams['pdf.fonttype'] = 42

读入10X的scRNA-seq矩阵，读入raw counts矩阵为scipy sparse矩阵，cells作为行，genes作为列：

input_dir = '/Users/yuanzan/Desktop/doublets/filtered_gene_bc_matrices/GRCh38/'
counts_matrix = scipy.io.mmread(input_dir + '/matrix.mtx').T.tocsc()
genes = np.array(scr.load_genes(input_dir + '/genes.tsv', delimiter='\t', column=1))
out_df = pd.read_csv(input_dir + '/barcodes.tsv', header = None, index_col=None, names=['barcode'])


print('Counts matrix shape: {} rows, {} columns'.format(counts_matrix.shape[0], counts_matrix.shape[1]))
print('Number of genes in gene list: {}'.format(len(genes)))

Counts matrix shape: 8381 rows, 33694 columns
Number of genes in gene list: 33694

初始化Scrublet对象

计算doublet score

运行下面的代码计算doublet score，内部处理过程包括:

Doublet simulation
Normalization, gene filtering, rescaling, PCA
Doublet score calculation
Doublet score threshold detection and doublet calling

doublet_scores, predicted_doublets = scrub.scrub_doublets(min_counts=2, min_cells=3, min_gene_variability_pctl=85, n_prin_comps=30)

绘制doublet score分布直方图

Doublet score分布直方图包括观察到的转录组和模拟的doublet，模拟的doublet直方图通常是双峰的。

下面左图模式对应于由具有相似基因表达的两个细胞产生的"embedded" doublets；
右图模式对应"neotypic" doublets，由具有不同基因表达的细胞（例如，不同类型的细胞）产生，这些会在下游分析中引入更多的假象。Scrublet只能检测"neotypic" doublets。

要call doublets vs. singlets，我们必须设置一个doublet score阈值，理想情况下，阈值应在模拟doublet直方图的两种模式之间设置最小值。scrub_doublets()函数尝试自动识别这一点，在这个测试数据示例中表现比较好。如果自动阈值检测效果不佳，则可以使用call_doublets()函数调整阈值，例如：

scrub.call_doublets(threshold=0.25)

# 画doublet score直方图
scrub.plot_histogram()

降维可视化

降维计算

这个示例采用UMAP降维，还有tSNE可选，作者不推荐用tSNE，因为运行比较慢。

print('Running UMAP...')
scrub.set_embedding('UMAP', scr.get_umap(scrub.manifold_obs_, 10, min_dist=0.3))
print('Done.')

UMAP可视化

scrub.plot_embedding('UMAP', order_points=True)

下面左图黑色的点为预测的doublets。

# doublets占比
print (scrub.detected_doublet_rate_)
# 0.043789523923159525

把doublets预测结果保存到文件，后续用Seurat等软件处理的时候可以导入doublets的预测结果对barcode进行筛选。

out_df['doublet_scores'] = doublet_scores
out_df['predicted_doublets'] = predicted_doublets
out_df.to_csv(input_dir + '/doublet.txt', index=False,header=True)
out_df.head()

barcode	doublet_scores	predicted_doublets
AAACCTGAGCATCATC-1	0.020232985898221900	FALSE
AAACCTGAGCTAACTC-1	0.009746972531259230	FALSE
AAACCTGAGCTAGTGG-1	0.013493253373313300	FALSE
AAACCTGCACATTAGC-1	0.087378640776699	FALSE
AAACCTGCACTGTTAG-1	0.02405046655276650	FALSE
AAACCTGCATAGTAAG-1	0.03969184391224250	FALSE
AAACCTGCATGAACCT-1	0.030082836796977200	FALSE

此次测试的jupyter notebook上传到了我的github-seqyuan，有需要可以下载测试。

最后编辑于：2020.04.14 11:02:38

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 204,053评论 6赞 478
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 85,527评论 2赞 381
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 150,779评论 0赞 337
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 54,685评论 1赞 276
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 63,699评论 5赞 366
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,609评论 1赞 281
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 37,989评论 3赞 396
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,654评论 0赞 258
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 40,890评论 1赞 298
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,634评论 2赞 321
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,716评论 1赞 330
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,394评论 4赞 319
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 38,976评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,950评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,191评论 1赞 260
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 44,849评论 2赞 349
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,458评论 2赞 342