【单细胞组学】CITE-seq-Count:分析CITEseq测序数据

导言

昨天我们介绍了CITE-seq的测序原理和简单应用。目前很多国内外公司都能做CITE-seq测序,至少我们知道10x Genomics是可以做的。有了测序数据(reads文件),接下来我们看看怎么得到蛋白的表达数据。

CITE-seq的研发者创建的网站上(https://cite-seq.com/computational-tools/)给出了推荐使用的软件。

对于read-level数据可以使用CITE-seq-Count来获得count 矩阵

CITE-seq-Count的安装和使用

CITE-seq-Count是用Python搭建的工具,所以可以用pip安装:

pip install CITE-seq-Count==1.4.3

How to use it

CITE-seq-Count -R1 TAGS_R1.fastq.gz -R2 TAGS_R2.fastq.gz -t TAG_LIST.csv -cbf X1 -cbl X2 -umif Y1 -umil Y2 -cells EXPECTED_CELLS -o OUTFOLDER

这条命令将对比对到抗体DNA标签序列的reads和UMIs进行计数。

下图解释了其中的输入文件TAGS_R1.fastq.gz和TAGS_R2.fastq.gz预期的结构:

其实官方文档已经写的很好了,结构非常清晰,而且英文也更容易帮助我们理解其中一些术语,所以下面的内容我基本都不做翻译。这个工具使用起来很简单,输出结果也完全可以和Seurat无缝连接。

软件参数:

INPUT

  • [Required] Read1 fastq file location in fastq.gz format. Read 1 typically contains Cell barcode and UMI.

-R1 READ1_PATH.fastq.gz

--read1 READ1_PATH.fastq.gz

  • [Required] Read2 fastq file location in fastq.gz. Read 2 typically contains the Antibody barcode.

-R2 READ2_PATH.fastq.gz

--read2 READ2_PATH.fastq.gz

  • [Required] The path to the csv file containing the antibody barcodes as well as their respective names. You can run tags of different length together.

-t tags.csv, --tags tags.csv

Antibody barcodes structure:

ATGCGA,First_tag_name
GTCATG,Second_tag_name
GCTAGTCGTACGA,Third_tag_nameGCTAGGTGTCGTA,Forth_tag_name

IMPORTANT: You need to provide only the variable region of the TAG in the tags.csv. Please refer to the following examples.

  • CASE1: Legacy barcodes. If you are using barcoes that have a T, C or G plus a polyA tail at the end, the tags.csv file should not contain those additions. Expected barcode:
GCTAGTCGTACGA T AAAAAAAAAA
GCTAGTCGTACGA C AAAAAAAAAA
GCTAGTCGTACGA G AAAAAAAAAA
GCTGTCAGCATAC T AAAAAAAAAA
GCTGTCAGCATAC C AAAAAAAAAAGCTGTCAGCATAC G AAAAAAAAAA

The tags.csv should only contain the part before the T

GCTAGTCGTACGA,tag1GCTGTCAGCATAC,tag2

  • CASE2: Constant sequences. If you are using barcoes that have a constant sequence at the end or at the start, the tags.csv file should only contain the variable part. You should also use the -trim``--start-trim option to tell CITE-seq-Count where the variable part of the barcode starts Expected barcode:
CGTAGTCGTAGCTA GCTAGTCGTACGA GCTAGCTGACTCGTAGTCGTAGCTA AACGTAGCTATGT 
GCTAGCTGACTCGTAGTCGTAGCTA GCTAGCATATCAG GCTAGCTGACT

The tags.csv should only contain the variable parts and use -trim 14 to trim the first 14 bases.

GCTAGTCGTACGA,tag1
AACGTAGCTATGT,tag2GCTAGCATATCAG,tag3

BARCODING

Positions of the cellular and UMI barcodes.

  • [Required] First nucleotide of cell barcode in read 1. For Drop-seq and 10x Genomics this is typically 1.

-cbf CB_FIRST, --cell_barcode_first_base CB_FIRST

  • [Required] Last nucleotide of the cell barcode in read 1. For 10x Genomics this is typically 16. For Drop-seq this depends on the bead configuration, typically 12.

-cbl CB_LAST, --cell_barcode_last_base CB_LAST

  • [Required] First nucleotide of the UMI in read 1. For 10x Genomics this is typically 17. For Drop-seq this is typically 13.

-umif UMI_FIRST, --umi_first_base UMI_FIRST

  • [Required] Last nucleotide of the UMI in read 1. For 10x Genomics this is typically 26. For Drop-seq this is typically 20.

-umil UMI_LAST, --umi_last_base UMI_LASTExample: Barcodes from 1 to 16 and UMI from 17 to 26, then this is the input you need:

-cbf 1 -cbl 16 -umif 17 -umil 26

  • [Optional] How many errors are allowed between two cell barcodes to collapse them onto one cell.

--bc_collapsing_dist N_ERRORS, default 1

  • [Optional] How many errors are allowed between two umi within the same cell and TAG to collapse.

--umi_collapsing_dist N_ERRORS, default 2

  • [Optional] Deactivate UMI correction.

--no_umi_correction

Cells

You have to choose either the number of cells you expect or give it a list of cell barcodes to retrieve.

  • [Required] How many cells you expect in your run.
  • [Optional] If a whitelist is provided.

-cells EXPECTED_CELLS, --expected_cells EXPECTED_CELLS

  • [Optional] Whitelist of cell barcodes provided as a csv file. CITE-seq-Count will search for those barcodes in the data and correct other barcodes based on this list. Will force the output to provide all the barcodes from the whitelist. Please see the guidelines for information regarding specific chemistries.

-wl WHITELIST, --whitelist WHITELISTExample:

ATGCTAGTGCTAGCTAGTCAGGATCGACTGCTAACG

FILTERING

Filtering for structure of the antibody barcode as well as maximum errors.

  • [OPTIONAL] Maximum Levenshtein distance allowed. This allows to catch antibody barcodes that might have --max-error errors compared to the real barcodes. (was -hd in previous versions)

--max-error MAX_ERROR, default 3Example:

If we have this kind of antibody barcode:

ATGCCAGThe script will be looking for ATGCCAG in R2

A MAX_ERROR of 1 will allow barcodes such as ATGTCAG, having one mismatch to be counted as valid.

There is a sanity check when for the MAX_ERROR value chosen to be sure you are not allowing too many mismatches and confuse your antibody barcodes. Mismatches on cell or UMI barcodes are discarded.

  • [Optional] How many bases should we trim before starting to map. See CASE2 of special examples in the

-trim N_BASES, --start-trim N_BASES, default 0

  • [OPTIONAL] Activate sliding window alignement. Use this when you have a protocol that has a variable sequence before the inserted TAG.

--sliding-window, default FalseExample:

The TAG: ATGCTAGCT with a variable prefix: TTCAATTTCA R2 reads:

TTCA ATGCTAGCTAAAAAAAAAAAAAAAAA
TTCAA ATGCTAGCTAAAAAAAAAAAAAAAA
TTCAAT ATGCTAGCTAAAAAAAAAAAAAAA
TTCAATT ATGCTAGCTAAAAAAAAAAAAAA
TTCAATTT ATGCTAGCTAAAAAAAAAAAAATTCAATTTC ATGCTAGCTAAAAAAAAAAAA

OUTPUT

  • [Required] Path to the result folder that will contain both read and umi mtx results as well as a run_report.yaml and potential unmapped tags.

-o OUTFOLDER, --output OUTFOLDER, default Results

  • [Optional] Will output the dense umi count matrix in a tsv format in addition to the sparse outputs.

--dense

OPTIONAL

  • [Optional] Select first N reads to run on instead of all. This is usefull when trying to test out your parameters before running the whole dataset.

-n FIRST_N, --first_n FIRST_N

  • [Optional] How many threads/cores should be used by CITE-seq-Count.

-T N_THREADS, --threads N_THREADS, default Number of available cores

  • [Optional] Output file for unmapped tags. Can be useful to troubelshoot an experiment with low mapping percentage or high "uncorrected cells".

-u OUTFILE, --unmapped-tags OUTFILE, default unmapped.csv

  • [Optional] How many unmapped tags should be written to file

-ut N_UNMAPPED, --unknown-top-tags N_UNMAPPED, default 50

  • [Optional] Print more information about the mapping process. Only use it to find issues. Slows down the processing by a lot.

--debug

软件输出结果

Mtx format

The mtx, matrix market, format is a sparse format for matrices. It only stores non zero values and is becoming popular in single-cell softwares.

The main advantage is that it requires less space than a dense matrix and that you can easily add different feature names within the same object.

For CITE-seq-Count, the output looks like this:

OUTFOLDER/
-- umi_count/
-- -- matrix.mtx.gz
-- -- features.tsv.gz
-- -- barcodes.tsv.gz
-- read_count/
-- -- matrix.mtx.gz
-- -- features.tsv.gz
-- -- barcodes.tsv.gz
-- unmapped.csv
-- run_report.yaml

File descriptions

  • features.tsv.gz contains the feature names, in this context our tags.
  • barcodes.tsv.gz contains the cell barcodes.
  • matrix.mtx.gz contains the actual values. read_count and umi_count contain respectively the read counts and the collapsed umi counts. For analysis you should use the umi data. The read_count can be used to check if you have an overamplification or oversequencing issue with your protocol.
  • unmapped.csv contains the top N tags that haven't been mapped.
  • run_report.yaml contains the parameters used for the run as well as some statistics. here is an example:
Date: 2019-10-01Running time: 13.86 seconds
CITE-seq-Count Version: 1.4.3
Reads processed: 1000000
Percentage mapped: 33
Percentage unmapped: 67
Uncorrected cells: 0Correction:
    Cell barcodes collapsing threshold: 1
    Cell barcodes corrected: 57
    UMI collapsing threshold: 2
    UMIs corrected: 329
Run parameters:
    Read1_filename: fastq/test_R1.fastq.gz,fastq/test2_R1.fastq.gz
    Read2_filename: fastq/test_R2.fastq.gz,fastq/test2_R2.fastq.gz
    Cell barcode:
        First position: 1
        Last position: 16
    UMI barcode:
        First position: 17
        Last position: 26
    Expected cells: 100
    Tags max errors: 1
    Start trim: 0

Packages to read MTX

R:

I recommend using Seurat and their Read10x function to read the results.

With Seurat V3:

Read10x('OUTFOLDER/umi_count/', gene.column=1)

With Matrix:

library(Matrix)
matrix_dir = "/path_to_your_directory/out_cite_seq_count/umi_count/"barcode.path <- paste0(matrix_dir, "barcodes.tsv.gz")
features.path <- paste0(matrix_dir, "features.tsv.gz")
matrix.path <- paste0(matrix_dir, "matrix.mtx.gz")
mat <- readMM(file = matrix.path)
feature.names = read.delim(features.path, header = FALSE, stringsAsFactors = FALSE)
barcode.names = read.delim(barcode.path, header = FALSE, stringsAsFactors = FALSE)
colnames(mat) = barcode.names$V1
rownames(mat) = feature.names$V1

Python:

I recommend using scanpy and their read_mtx function to read the results.

Example:

import scanpy
import pandas as pd
import os
path = 'umi_cell_corrected'
data = scanpy.read_mtx(os.path.join(path,'umi_count/matrix.mtx.gz'))
data = data.T
features = pd.read_csv(os.path.join(path, 'umi_count/features.tsv.gz'), header=None)
barcodes = pd.read_csv(os.path.join(path, 'umi_count/barcodes.tsv.gz'), header=None)
data.var_names = features[0]
data.obs_names = barcodes[0]

Reference

[1]https://hoohm.github.io/CITE-seq-Count/

欢迎关注同名公众号

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 206,214评论 6 481
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 88,307评论 2 382
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 152,543评论 0 341
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 55,221评论 1 279
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 64,224评论 5 371
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 49,007评论 1 284
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 38,313评论 3 399
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,956评论 0 259
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 43,441评论 1 300
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,925评论 2 323
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,018评论 1 333
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,685评论 4 322
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 39,234评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,240评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,464评论 1 261
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 45,467评论 2 352
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,762评论 2 345

推荐阅读更多精彩内容