pacbio三代全长转录组数据分析isoseq3
1 参考官方文档及其他教程,包括原理、流程等
https://github.com/PacificBiosciences/IsoSeq/blob/master/README.md #isoseq3
https://github.com/PacificBiosciences/pbbioconda #PacBio Secondary Analysis Tools on Bioconda
https://github.com/PacificBiosciences/IsoSeq/blob/master/isoseq-clustering.md #见下图,Iso-Seq clustering 即无barcode无umi的分析流程
https://databeauty.com/blog/tutorial/2020/12/08/PacBio-Iso-Seq-Data-Analysis.html
2 软件安装
主要参考pbbioconda
pbbioconda提示需要安装python 2.7的环境;用conda新建一个python2.7的环境,这个在这就不说明了,有很多教程
在pbbioconda主页中,查看可用的分析软件及详细信息,点击对应的软件即可跳转到该软件的主页,点击Release Notes可查看软件版本更新等;需要安装的有:isoseq3 pbccs lima pbmm2
以pbccs为例,可点击Release Notes查看更新版本等相信信息,也可用conda search 查看可用的版本
conda search pbccs # conda search 查看可安装版本,search 结果如下,相应的也可在
# Loading channels: done
# Name Version Build Channel
pbccs 3.1.0 0 bioconda
pbccs 3.1.0 1 bioconda
pbccs 3.1.0 2 bioconda
pbccs 3.1.0 3 bioconda
pbccs 3.3.0 0 bioconda
pbccs 3.4.0 0 bioconda
pbccs 3.4.1 0 bioconda
pbccs 4.0.0 0 bioconda
pbccs 4.1.0 0 bioconda
pbccs 4.2.0 0 bioconda
pbccs 4.2.0 1 bioconda
pbccs 5.0.0 0 bioconda
pbccs 6.0.0 0 bioconda
pbccs 6.0.0 1 bioconda
pbccs 6.0.0 h9ee0642_2 bioconda
pbccs 6.2.0 h9ee0642_0 bioconda
pbccs 6.3.0 h9ee0642_0 bioconda
pbccs 6.4.0 h9ee0642_0 bioconda ##可见最新版本为6.4.0
conda install -c bioconda pbccs=6.4.0 ##安装
ccs ##安装成功,键入ccs 出现以下提示信息
ccs - Generate circular consensus sequences (ccs) from subreads.
Usage:
ccs [options] <IN.subreads.bam|xml> <OUT.ccs.bam|fastq.gz|xml>
IN.subreads.bam|xml FILE Subreads (.subreads.bam or .subreadset.xml).
OUT.ccs.bam|fastq.gz|xml FILE Consensus reads (.bam, .fastq.gz, or
.consensusreadset.xml).
Input Filter Options:
--min-passes INT Minimum number of full-length subreads
required to generate CCS for a ZMW. [3]
--min-snr FLOAT Minimum SNR of subreads to use for
generating CCS [2.5]
--top-passes INT Pick at maximum the top N passes for each
ZMW. [60]
Draft Filter Options:
--min-length INT Minimum draft length before polishing. [10]
--max-length INT Maximum draft length before polishing.
[50000]
Chunking Options:
--chunk STR Operate on a single chunk. Format i/N, where
i in [1,N]. Examples: 3/24 or 9/9
--max-chunks Determine maximum number of chunks.
Model Override Options:
--model-path STR Path to a chemistry model file or directory
containing model files.
--model-spec STR Name of chemistry or model to use,
overriding default selection.
Processing Options:
--by-strand Generate a consensus for each strand.
--hd-finder Enable heteroduplex finder and splitting
--skip-polish Only output the initial draft template
(faster, less accurate).
--all Emit all ZMWs.
--subread-fallback Emit a representative subread, instead of
the draft consensus, if polishing failed.
--all-kinetics Calculate mean pulse widths (PW) and
interpulse durations (IPD) for every ZMW.
--hifi-kinetics Calculate mean pulse widths (PW) and
interpulse durations (IPD) for every HiFi
read.
Output Filter Options:
--min-rq FLOAT Minimum predicted accuracy in [0, 1]. [0.99]
Output Files Options:
--report-file FILE Where to write the results report.
--report-json FILE Where to write the results report as json.
--metrics-json FILE Where to write the zmw metrics as json.
--suppress-reports Do not generate report or metric files per
default, only those requested.
-h,--help Show this help and exit.
--version Show application version and exit.
-j,--num-threads INT Number of threads to use, 0 means
autodetection. [0]
--log-level STR Set log level. Valid choices: (TRACE, DEBUG,
INFO, WARN, FATAL). [WARN]
--log-file FILE Log to a file, instead of stderr.
3 分析流程
主要参考isoseq-clustering
3.1 Input 即下机数据的subreads
3.2 Circular Consensus Sequence calling
用ccs 从subreads获取ccs
ccs movieX.subreads.bam movieX.ccs.bam --min-rq 0.9
# movieX.subreads.bam 为输入的subreads bam文件
# movieX.ccs.bam 为输出的ccs bam文件
# --min-rq Minimum predicted accuracy in [0, 1] 0-1
#还要另外一个参数 --min-passes 需要注意,即最少full passes数的reads默认值为3;如果有很长的转录本测序完成都可能达不到full pass ,可以改成1 --min-passes 1 ;我自己试了一下,1跟3好像没什么区别
#另外一个重要参数为 --skip-polish Only output the initial draft template (faster, less accurate) ccs中默认进行polish,添加这个参数则不进行polish,运算快一些但精确度下降;如果已经进行了polish则后面不需要再进行polish
3.3 Primer removal and demultiplexing
用limma去除引物序列
测序公司交付的数据中会有一个以.primer.fasta结尾的文件,是5' 3'的引物序列;去除引物序列前需要查看这个文件里面有多少对或者多少条引物序列
lima movieX.ccs.bam barcoded_primers.fasta movieX.fl.bam --isoseq --peek-guess
# movieX.ccs.bam为输入的ccs bam文件
# barcoded_primers.fasta 为测序下机提供的.premer.fasta结尾 提供primer的文件
# movieX.fl.bam为输出文件
# --iso-seq Activate specialized IsoSeq mode 用这个参数激活isoseq模式
# --peek-guess 如果在.primer.fasta文件中不止一对或多个5' 和3' 的引物序列,则需要用--peek-guess;如果只有一对5' 和3' 的引物序列则不需要添加这个参数
输出文件将会加上成对的引物名称,如下:
[movie].fl.NEB_5p--NEB_Clontech_3p.bam
3.4 Remove PolyA Tail and Artificial Concatemers 去除引物序列后,需要用isoseq3 refine 去polyA尾和人工引物的多聚体
isoseq3 refine --require-polya movieX.NEB_5p--NEB_Clontech_3p.fl.bam primers.fasta movieX.flnc.bam
#movieX.NEB_5p--NEB_Clontech_3p.fl.bam 为输入文件
# movieX.flnc.bam 为输出文件
#primers.fasta
#--require-polya # If your sample has poly(A) tails, use --require-polya. This filters for FL reads that have a poly(A) tail with at least 20 base pairs (--min-polya-length) and removes identified tail
得到输出文件:
<movie>.flnc.bam
<movie>.flnc.transcriptset.xml
3.5 Clustering
isoseq3 cluster [movie].flnc.bam [movie].polished.bam --verbose --use-qvs
# [movie].flnc.bam 输入文件夹
# [movie].flnc.bam 输出文件
# --verbose Use verbose output
# --use-qvs Use CCS QVs, sets --poa-cov 100
输出结果:
<prefix>.bam
<prefix>.hq.fasta.gz with predicted accuracy ≥ 0.99
<prefix>.lq.fasta.gz with predicted accuracy < 0.99
<prefix>.bam.pbi
<prefix>.transcriptset.xml
3.6 用pbmm2 align到基因组
pbmm2 index 构建index
pbmm2 align 进行比对
pbmm2
pbmm2 - minimap2 with native PacBio BAM support
Usage:
pbmm2 <tool>
-h,--help Show this help and exit.
--version Show application version and exit.
Tools:
index Index reference and store as .mmi file
align Align PacBio reads to reference sequences
Examples:
pbmm2 index ref.referenceset.xml ref.mmi
pbmm2 align ref.referenceset.xml movie.subreadset.xml ref.movie.alignmentset.xml
Typical workflows:
A. Generate index file for reference and reuse it to align reads
$ pbmm2 index ref.fasta ref.mmi
$ pbmm2 align ref.mmi movie.subreads.bam ref.movie.bam
B. Align reads and sort on-the-fly, with 4 alignment and 2 sort threads
$ pbmm2 align ref.fasta movie.subreads.bam ref.movie.bam --sort -j 4 -J 2
C. Align reads, sort on-the-fly, and create PBI
$ pbmm2 align ref.fasta movie.subreadset.xml ref.movie.alignmentset.xml --sort
D. Omit output file and stream BAM output to stdout
$ pbmm2 align hg38.mmi movie1.subreadset.xml | samtools sort > hg38.movie1.sorted.bam
E. Align CCS fastq input and sort on-the-fly
$ pbmm2 align ref.fasta movie.Q20.fastq ref.movie.bam --preset CCS --sort --rg '@RG\tID:myid\tSM:mysample'
Copyright (C) 2004-2022 Pacific Biosciences of California, Inc.
This program comes with ABSOLUTELY NO WARRANTY; it is intended for
Research Use Only and not for use in diagnostic procedures.