Input 输入
[General]
input_fofn=input.fofn
input_type=raw
pa_DBdust_option=true
pa_fasta_filter_option=streamed-median
input_type
: 可以为raw或者preads,如果指定preads,管道将跳过整个0-rawreads预组装阶段;
pa_fasta_filter_option
: 默认为streamed-internal-median,用于处理一个ZMW有多条subreads时,到底选择哪一条的问题。"pass": 不做过滤,全部要;"streamed-median": 表示选择中等长度的subreads;"streamed-internal-median": 当一个ZMW里的subread低于3条时选择最长,多于3条则选择中等长度的subreads。
Data Partitioning 数据分区
# large genomes
pa_DBsplit_option=-x500 -s200
ovlp_DBsplit_option=-x500 -s200
# small genomes (<10Mb)
pa_DBsplit_option = -x500 -s50
ovlp_DBsplit_option = -x500 -s50
这部分的设置会将参数传递给DBsplit,将数据进行拆分多个block,后续的运算都基于blocks,-s
控制 DB blocks的大小
如果前面设置了
pa_fasta_filter_option=pass
,pa_DBsplit_option
这里要加一个-a
选项
Repeat Masking 屏蔽重复序列
pa_HPCTANmask_option=
pa_REPmask_code=0,300;0,300;0,300
Repeat masking occurs in two phases, Tandem and Interspersed. Tandem repeat masking is run with a modified version of daligner
called datander
and thus uses a similar parameter set. Whatever settings you use for pre-assembly daligner overlapping in the next section (pa_daligner_option
) will be used here for tandem repeat masking. You can supply additional arguments for tandem repeat masking that will be passed to HPC.TANmask
with the pa_HPCTANmask_option
.
The second phase of masking deals with interspersed repeats and can be run in up to 3 iterations specified with thepa_REPmask_code
option. The parameters needed for each iteration are both the group size and coverage specified as group,coverage
pairs separated by semicolons as seen above.
For information and theory on how to set up your rounds of repeat masking, consult this blog post.
Pre-assembly 预组装
genome_size=1000000000
seed_coverage=30
length_cutoff=-1
pa_HPCdaligner_option=-v -B128 -M24
pa_daligner_option=-e0.8 -l2000 -k18 -h480 -w8 -s100
falcon_sense_option=--output-multi --min-idt 0.70 --min-cov 3 --max-n-read 400
falcon_sense_greedy=False
During pre-assembly, the PacBio subreads are aligned and error correction is performed. The longest subreads are chosen as seed reads and all shorter reads are aligned to them and consensus sequences are generated from the alignments. These consensus sequences are called pre-assembled reads or preads
and generally have accuracy greater than 99% or QV20.
如果你想自动计算种子subreads覆盖度,那就不用去设置 genome_size
和 seed_coverage
, 只需设置length_cutoff=-1
即可自动计算。我们一般推荐“20-40x”种子覆盖度。
另外,如果你不知道基因组大小,不确定seed_coverage
的大小或者如果您只想利用特定长度以上的所有reads,您可以使用length_cutoff
手动设置该限制。
需要注意的是,无论
length_cutoff
被设置为什么值,都是对falcon-unzip
的一个限制,任何小于该截断值的reads都不会用于phasing。对于组装来说,除非你期望一个特定的特性,比如微染色体或短圆形质粒,否则在设置高的length_cutoff
时可能不会有什么害处。但是,如果你打算unzip,那么你就应该人为地限制你的phasing数据集,而拥有一个较低的length_cutoff
可能对你有好处。大多数计算都发生在预组装中,因此如果计算时间对您很重要,那么增加length_cutoff
将提高效率,但是需要进行上述权衡。
Overlap options for daligner
are set with the pa_HPCdaligner_option
and pa_daligner_option
flags. Previous versions of FALCON had a single parameter. This is now split into two flags, one that affects requested resources pa_HPCdaligner_option
and one that affects the overlap search pa_daligner_option
. For pa_HPCdaligner_option
, the -v
parameter is passed to the LAsort
and LAmerge
programs while -B
and -M
parameters are passed to the daligner
sub-commands.
To understand the theory and how to configure daligner
see this blog post and this command reference guide.
For daligner
, in general we recommend the following:
-e
: average correlation rate (average sequence identity)
0.70
(low quality data) - 0.80
(high quality data). A higher value will help prevent haplotype collapse.
-l
: minimum length of overlap
1000
(shorter library) - 5000
(longer library)
-k
: kmer size
14
(low quality data) - 18
(high quality data)
较低的
-k
值在增加磁盘空间、内存消耗和较慢的运行时间之间具有较高的敏感性,并且在较低质量的数据下工作得最好。相反,对于-k
,较大的kmer值具有更高的特异性,使用更少的系统资源,运行速度更快,但是只适用于高质量的数据
You can configure basic pre-assembly consensus calling options with the falcon_sense_option
flag.
--output-multi
necessary for generating proper fasta headers
--min-idt
minimum alignment identity
--min-cov
minimum coverage necessary
--max-n-read
max number of reads for calling consensus to make the preads
By default, -fo
are the parameters passed to LA4Falcon
. The option falcon_sense_greedy
changes this parameter set to -fog
which essentially attempts to maintain relative information between reads that have been broken due to regions of low quality.
Pread overlapping 重叠
ovlp_HPCdaligner_option=-v -M24 -l500
ovlp_daligner_option=-e.96 -s1000 -h60
The second phase of error-corrected read overlapping occurs in a similar fashion to the overlapping performed in the pre-assembly, however no repeat masking is performed and no consensus is called. Overlaps are identified and fed into the final assembly. The parameter options work the same way as described above in the pre-assembly section.
Recommendation for preads:
-e
: average correlation rate (average sequence identity)
0.93
(inbred) - 0.96
(outbred)
-l
: minimum length of overlap
1800
(poor preassembly, short/low quality library) - 6000
(long, high quality library)
-k
: kmer size
18
(low quality) - 24
(most cases)
Final Assembly 最终组装
# experimenent with "--min-idt" to collapse (98-99) or split haplotypes (up to 99.9) during contig assembly
# if you plan to unzip, collapse first using ~98, lower for very divergent haplotypes
# ignore indels looks at only substitutions in overlaps, allows higher overlap stringency to reduce repeat-induced errors
overlap_filtering_setting = --max-diff 400 --max-cov 400 --min-cov 2 --n-core 24 --min-idt 99.9 --ignore-indels
overlap_filtering_setting=--max-diff 100 --max-cov 100 --min-cov 2
fc_ovlp_to_graph_option=
length_cutoff_pr=1000
The option overlap_filter_setting
allows setting criteria for filtering pread overlaps. --max-diff
filters overlaps that have a coverage difference between the 5' and 3' ends larger than specified. --max-cov
filters highly represented overlaps typically caused by contaminants or repeats and --min-cov
allows specification of a minimum overlap coverage.
将
--min-cov
设置得太低将允许检测到更多的重叠,代价是可能会出现额外的嵌合/错误组装。
length_cutoff_pr
is the minimum length of pre-assembled preads used for the final assembly. Typically, this value is set to allow for approximately 15 to 30-fold coverage of corrected reads in the final assembly.
通常,将此值设置为允许在最终组装中对corrected reads进行大约15到30倍的覆盖度的长度。
Miscellaneous configuration options 其他选项
Additional configuration options that don't necessarily fit into one of the previous categories are described here.
target=assembly
skip_checks=False
LA4Falcon_preload=false
FALCON can be configured to stop after any of its three stages with the target
flag set to either overlapping
, pre-assembly
or assembly
. Each option will stop the pipeline at the end of its corresponding stage, 0-rawreads
, 1-preads_ovl
or 2-asm-falcon
respectively. The default is full assembly
pipeline.
The flag skip_checks
disables .las
file checks with LAcheck
which has been known to cause errors on certain systems in the past.
选项LA4Falcon_preload
将-P
参数传递给LA4Falcon
,从而将所有读取操作加载到内存中。在较慢的文件系统上,这可以显著加快速度,但这将大大增加consensus阶段的内存需求。