Pacbio 三代全长转录组IsoSeq3数据分析

pacbio三代全长转录组数据分析isoseq3
1 参考官方文档及其他教程，包括原理、流程等

https://github.com/PacificBiosciences/IsoSeq/blob/master/README.md  #isoseq3 
https://github.com/PacificBiosciences/pbbioconda  #PacBio Secondary Analysis Tools on Bioconda
https://github.com/PacificBiosciences/IsoSeq/blob/master/isoseq-clustering.md #见下图，Iso-Seq clustering 即无barcode无umi的分析流程
https://databeauty.com/blog/tutorial/2020/12/08/PacBio-Iso-Seq-Data-Analysis.html

image.png

2 软件安装
主要参考pbbioconda
pbbioconda提示需要安装python 2.7的环境；用conda新建一个python2.7的环境，这个在这就不说明了，有很多教程

image.png

在pbbioconda主页中，查看可用的分析软件及详细信息，点击对应的软件即可跳转到该软件的主页，点击Release Notes可查看软件版本更新等；需要安装的有：isoseq3 pbccs lima pbmm2

image.png

以pbccs为例，可点击Release Notes查看更新版本等相信信息，也可用conda search 查看可用的版本

conda search pbccs # conda search 查看可安装版本，search 结果如下，相应的也可在
# Loading channels: done
# Name                       Version           Build  Channel
pbccs                          3.1.0               0  bioconda
pbccs                          3.1.0               1  bioconda
pbccs                          3.1.0               2  bioconda
pbccs                          3.1.0               3  bioconda
pbccs                          3.3.0               0  bioconda
pbccs                          3.4.0               0  bioconda
pbccs                          3.4.1               0  bioconda
pbccs                          4.0.0               0  bioconda
pbccs                          4.1.0               0  bioconda
pbccs                          4.2.0               0  bioconda
pbccs                          4.2.0               1  bioconda
pbccs                          5.0.0               0  bioconda
pbccs                          6.0.0               0  bioconda
pbccs                          6.0.0               1  bioconda
pbccs                          6.0.0      h9ee0642_2  bioconda
pbccs                          6.2.0      h9ee0642_0  bioconda
pbccs                          6.3.0      h9ee0642_0  bioconda
pbccs                          6.4.0      h9ee0642_0  bioconda ##可见最新版本为6.4.0
conda install -c bioconda pbccs=6.4.0 ##安装
ccs ##安装成功，键入ccs 出现以下提示信息
ccs - Generate circular consensus sequences (ccs) from subreads.

Usage:
  ccs [options] <IN.subreads.bam|xml> <OUT.ccs.bam|fastq.gz|xml>

  IN.subreads.bam|xml       FILE   Subreads (.subreads.bam or .subreadset.xml).
  OUT.ccs.bam|fastq.gz|xml  FILE   Consensus reads (.bam, .fastq.gz, or
                                   .consensusreadset.xml).


Input Filter Options:
  --min-passes              INT    Minimum number of full-length subreads
                                   required to generate CCS for a ZMW. [3]
  --min-snr                 FLOAT  Minimum SNR of subreads to use for
                                   generating CCS [2.5]
  --top-passes              INT    Pick at maximum the top N passes for each
                                   ZMW. [60]

Draft Filter Options:
  --min-length              INT    Minimum draft length before polishing. [10]
  --max-length              INT    Maximum draft length before polishing.
                                   [50000]

Chunking Options:
  --chunk                   STR    Operate on a single chunk. Format i/N, where
                                   i in [1,N]. Examples: 3/24 or 9/9
  --max-chunks                     Determine maximum number of chunks.

Model Override Options:
  --model-path              STR    Path to a chemistry model file or directory
                                   containing model files.
  --model-spec              STR    Name of chemistry or model to use,
                                   overriding default selection.

Processing Options:
  --by-strand                      Generate a consensus for each strand.
  --hd-finder                      Enable heteroduplex finder and splitting
  --skip-polish                    Only output the initial draft template
                                   (faster, less accurate).
  --all                            Emit all ZMWs.
  --subread-fallback               Emit a representative subread, instead of
                                   the draft consensus, if polishing failed.
  --all-kinetics                   Calculate mean pulse widths (PW) and
                                   interpulse durations (IPD) for every ZMW.
  --hifi-kinetics                  Calculate mean pulse widths (PW) and
                                   interpulse durations (IPD) for every HiFi
                                   read.

Output Filter Options:
  --min-rq                  FLOAT  Minimum predicted accuracy in [0, 1]. [0.99]

Output Files Options:
  --report-file             FILE   Where to write the results report.
  --report-json             FILE   Where to write the results report as json.
  --metrics-json            FILE   Where to write the zmw metrics as json.
  --suppress-reports               Do not generate report or metric files per
                                   default, only those requested.

  -h,--help                        Show this help and exit.
  --version                        Show application version and exit.
  -j,--num-threads          INT    Number of threads to use, 0 means
                                   autodetection. [0]
  --log-level               STR    Set log level. Valid choices: (TRACE, DEBUG,
                                   INFO, WARN, FATAL). [WARN]
  --log-file                FILE   Log to a file, instead of stderr.

3 分析流程
主要参考isoseq-clustering
3.1 Input 即下机数据的subreads

image.png

3.2 Circular Consensus Sequence calling
用ccs 从subreads获取ccs

ccs movieX.subreads.bam movieX.ccs.bam --min-rq 0.9 
# movieX.subreads.bam 为输入的subreads bam文件
# movieX.ccs.bam 为输出的ccs bam文件
# --min-rq  Minimum predicted accuracy in [0, 1]   0-1 
#还要另外一个参数 --min-passes 需要注意，即最少full passes数的reads默认值为3；如果有很长的转录本测序完成都可能达不到full pass ，可以改成1 --min-passes 1 ；我自己试了一下，1跟3好像没什么区别 
#另外一个重要参数为 --skip-polish  Only output the initial draft template (faster, less accurate) ccs中默认进行polish，添加这个参数则不进行polish，运算快一些但精确度下降；如果已经进行了polish则后面不需要再进行polish

3.3 Primer removal and demultiplexing
用limma去除引物序列
测序公司交付的数据中会有一个以.primer.fasta结尾的文件，是5' 3'的引物序列;去除引物序列前需要查看这个文件里面有多少对或者多少条引物序列

lima movieX.ccs.bam barcoded_primers.fasta movieX.fl.bam --isoseq --peek-guess 
# movieX.ccs.bam为输入的ccs bam文件
# barcoded_primers.fasta 为测序下机提供的.premer.fasta结尾 提供primer的文件
# movieX.fl.bam为输出文件
# --iso-seq  Activate specialized IsoSeq mode 用这个参数激活isoseq模式
# --peek-guess  如果在.primer.fasta文件中不止一对或多个5' 和3' 的引物序列，则需要用--peek-guess；如果只有一对5' 和3' 的引物序列则不需要添加这个参数

输出文件将会加上成对的引物名称，如下：

[movie].fl.NEB_5p--NEB_Clontech_3p.bam

3.4 Remove PolyA Tail and Artificial Concatemers 去除引物序列后，需要用isoseq3 refine 去polyA尾和人工引物的多聚体

isoseq3 refine --require-polya movieX.NEB_5p--NEB_Clontech_3p.fl.bam primers.fasta movieX.flnc.bam
#movieX.NEB_5p--NEB_Clontech_3p.fl.bam 为输入文件
# movieX.flnc.bam 为输出文件
#primers.fasta 
#--require-polya  # If your sample has poly(A) tails, use --require-polya. This filters for FL reads that have a poly(A) tail with at least 20 base pairs (--min-polya-length) and removes identified tail

得到输出文件：

<movie>.flnc.bam
<movie>.flnc.transcriptset.xml

3.5 Clustering

isoseq3 cluster [movie].flnc.bam [movie].polished.bam --verbose --use-qvs
# [movie].flnc.bam 输入文件夹
# [movie].flnc.bam 输出文件
# --verbose Use verbose output 
# --use-qvs  Use CCS QVs, sets --poa-cov 100

输出结果：

<prefix>.bam
<prefix>.hq.fasta.gz with predicted accuracy ≥ 0.99
<prefix>.lq.fasta.gz with predicted accuracy < 0.99
<prefix>.bam.pbi
<prefix>.transcriptset.xml

3.6 用pbmm2 align到基因组
pbmm2 index 构建index
pbmm2 align 进行比对

pbmm2
pbmm2 - minimap2 with native PacBio BAM support

Usage:
  pbmm2 <tool>

  -h,--help    Show this help and exit.
  --version    Show application version and exit.

Tools:
  index      Index reference and store as .mmi file
  align      Align PacBio reads to reference sequences

Examples:
  pbmm2 index ref.referenceset.xml ref.mmi
  pbmm2 align ref.referenceset.xml movie.subreadset.xml ref.movie.alignmentset.xml

Typical workflows:
  A. Generate index file for reference and reuse it to align reads
     $ pbmm2 index ref.fasta ref.mmi
     $ pbmm2 align ref.mmi movie.subreads.bam ref.movie.bam

  B. Align reads and sort on-the-fly, with 4 alignment and 2 sort threads
     $ pbmm2 align ref.fasta movie.subreads.bam ref.movie.bam --sort -j 4 -J 2

  C. Align reads, sort on-the-fly, and create PBI
     $ pbmm2 align ref.fasta movie.subreadset.xml ref.movie.alignmentset.xml --sort

  D. Omit output file and stream BAM output to stdout
     $ pbmm2 align hg38.mmi movie1.subreadset.xml | samtools sort > hg38.movie1.sorted.bam

  E. Align CCS fastq input and sort on-the-fly
     $ pbmm2 align ref.fasta movie.Q20.fastq ref.movie.bam --preset CCS --sort --rg '@RG\tID:myid\tSM:mysample'

Copyright (C) 2004-2022     Pacific Biosciences of California, Inc.
This program comes with ABSOLUTELY NO WARRANTY; it is intended for
Research Use Only and not for use in diagnostic procedures.