pb-assembly=0.0.6参数设置

Input 输入

[General]
input_fofn=input.fofn
input_type=raw
pa_DBdust_option=true
pa_fasta_filter_option=streamed-median

input_type: 可以为raw或者preads,如果指定preads,管道将跳过整个0-rawreads预组装阶段;

pa_fasta_filter_option: 默认为streamed-internal-median,用于处理一个ZMW有多条subreads时,到底选择哪一条的问题。"pass": 不做过滤,全部要;"streamed-median": 表示选择中等长度的subreads;"streamed-internal-median": 当一个ZMW里的subread低于3条时选择最长,多于3条则选择中等长度的subreads。

Data Partitioning 数据分区

# large genomes
pa_DBsplit_option=-x500 -s200
ovlp_DBsplit_option=-x500 -s200

# small genomes (<10Mb)
pa_DBsplit_option = -x500 -s50
ovlp_DBsplit_option = -x500 -s50

这部分的设置会将参数传递给DBsplit,将数据进行拆分多个block,后续的运算都基于blocks,-s 控制 DB blocks的大小

如果前面设置了pa_fasta_filter_option=passpa_DBsplit_option这里要加一个 -a选项

Repeat Masking 屏蔽重复序列

pa_HPCTANmask_option=
pa_REPmask_code=0,300;0,300;0,300

Repeat masking occurs in two phases, Tandem and Interspersed. Tandem repeat masking is run with a modified version of daligner called datander and thus uses a similar parameter set. Whatever settings you use for pre-assembly daligner overlapping in the next section (pa_daligner_option) will be used here for tandem repeat masking. You can supply additional arguments for tandem repeat masking that will be passed to HPC.TANmask with the pa_HPCTANmask_option.

The second phase of masking deals with interspersed repeats and can be run in up to 3 iterations specified with thepa_REPmask_code option. The parameters needed for each iteration are both the group size and coverage specified as group,coverage pairs separated by semicolons as seen above.

For information and theory on how to set up your rounds of repeat masking, consult this blog post.

Pre-assembly 预组装

genome_size=1000000000
seed_coverage=30
length_cutoff=-1    
pa_HPCdaligner_option=-v -B128 -M24
pa_daligner_option=-e0.8 -l2000 -k18 -h480  -w8 -s100
falcon_sense_option=--output-multi --min-idt 0.70 --min-cov 3 --max-n-read 400
falcon_sense_greedy=False

During pre-assembly, the PacBio subreads are aligned and error correction is performed. The longest subreads are chosen as seed reads and all shorter reads are aligned to them and consensus sequences are generated from the alignments. These consensus sequences are called pre-assembled reads or preads and generally have accuracy greater than 99% or QV20.

如果你想自动计算种子subreads覆盖度,那就不用去设置 genome_sizeseed_coverage, 只需设置length_cutoff=-1即可自动计算。我们一般推荐“20-40x”种子覆盖度。
另外,如果你不知道基因组大小,不确定seed_coverage 的大小或者如果您只想利用特定长度以上的所有reads,您可以使用length_cutoff手动设置该限制。

需要注意的是,无论length_cutoff被设置为什么值,都是对falcon-unzip的一个限制,任何小于该截断值的reads都不会用于phasing。对于组装来说,除非你期望一个特定的特性,比如微染色体或短圆形质粒,否则在设置高的length_cutoff时可能不会有什么害处。但是,如果你打算unzip,那么你就应该人为地限制你的phasing数据集,而拥有一个较低的length_cutoff可能对你有好处。大多数计算都发生在预组装中,因此如果计算时间对您很重要,那么增加length_cutoff将提高效率,但是需要进行上述权衡。

Overlap options for daligner are set with the pa_HPCdaligner_option and pa_daligner_option flags. Previous versions of FALCON had a single parameter. This is now split into two flags, one that affects requested resources pa_HPCdaligner_optionand one that affects the overlap search pa_daligner_option. For pa_HPCdaligner_option, the -v parameter is passed to the LAsort and LAmerge programs while -B and -M parameters are passed to the daligner sub-commands.

To understand the theory and how to configure daligner see this blog post and this command reference guide.

For daligner, in general we recommend the following:

-e: average correlation rate (average sequence identity)

0.70 (low quality data) - 0.80 (high quality data). A higher value will help prevent haplotype collapse.

-l: minimum length of overlap

1000 (shorter library) - 5000 (longer library)

-k: kmer size

14 (low quality data) - 18 (high quality data)

较低的-k值在增加磁盘空间、内存消耗和较慢的运行时间之间具有较高的敏感性,并且在较低质量的数据下工作得最好。相反,对于-k,较大的kmer值具有更高的特异性,使用更少的系统资源,运行速度更快,但是只适用于高质量的数据

You can configure basic pre-assembly consensus calling options with the falcon_sense_option flag.
--output-multi necessary for generating proper fasta headers
--min-idt minimum alignment identity
--min-cov minimum coverage necessary
--max-n-read max number of reads for calling consensus to make the preads

By default, -fo are the parameters passed to LA4Falcon. The option falcon_sense_greedy changes this parameter set to -fog which essentially attempts to maintain relative information between reads that have been broken due to regions of low quality.

Pread overlapping 重叠

ovlp_HPCdaligner_option=-v -M24 -l500
ovlp_daligner_option=-e.96 -s1000 -h60

The second phase of error-corrected read overlapping occurs in a similar fashion to the overlapping performed in the pre-assembly, however no repeat masking is performed and no consensus is called. Overlaps are identified and fed into the final assembly. The parameter options work the same way as described above in the pre-assembly section.

Recommendation for preads:

-e: average correlation rate (average sequence identity)

0.93 (inbred) - 0.96 (outbred)

-l: minimum length of overlap

1800 (poor preassembly, short/low quality library) - 6000 (long, high quality library)

-k: kmer size

18 (low quality) - 24 (most cases)

Final Assembly 最终组装

# experimenent with "--min-idt" to collapse (98-99) or split haplotypes (up to 99.9) during contig assembly
# if you plan to unzip, collapse first using ~98, lower for very divergent haplotypes
# ignore indels looks at only substitutions in overlaps, allows higher overlap stringency to reduce repeat-induced errors
overlap_filtering_setting = --max-diff 400 --max-cov 400 --min-cov 2 --n-core 24 --min-idt 99.9 --ignore-indels

overlap_filtering_setting=--max-diff 100 --max-cov 100 --min-cov 2
fc_ovlp_to_graph_option=
length_cutoff_pr=1000

The option overlap_filter_setting allows setting criteria for filtering pread overlaps. --max-diff filters overlaps that have a coverage difference between the 5' and 3' ends larger than specified. --max-cov filters highly represented overlaps typically caused by contaminants or repeats and --min-cov allows specification of a minimum overlap coverage.

--min-cov设置得太低将允许检测到更多的重叠,代价是可能会出现额外的嵌合/错误组装。

length_cutoff_pr is the minimum length of pre-assembled preads used for the final assembly. Typically, this value is set to allow for approximately 15 to 30-fold coverage of corrected reads in the final assembly.

通常,将此值设置为允许在最终组装中对corrected reads进行大约15到30倍的覆盖度的长度。

Miscellaneous configuration options 其他选项

Additional configuration options that don't necessarily fit into one of the previous categories are described here.

target=assembly
skip_checks=False
LA4Falcon_preload=false

FALCON can be configured to stop after any of its three stages with the target flag set to either overlapping, pre-assembly or assembly. Each option will stop the pipeline at the end of its corresponding stage, 0-rawreads, 1-preads_ovlor 2-asm-falcon respectively. The default is full assembly pipeline.

The flag skip_checks disables .las file checks with LAcheck which has been known to cause errors on certain systems in the past.

选项LA4Falcon_preload-P参数传递给LA4Falcon,从而将所有读取操作加载到内存中。在较慢的文件系统上,这可以显著加快速度,但这将大大增加consensus阶段的内存需求。

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 205,236评论 6 478
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 87,867评论 2 381
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 151,715评论 0 340
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,899评论 1 278
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,895评论 5 368
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,733评论 1 283
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 38,085评论 3 399
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,722评论 0 258
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 43,025评论 1 300
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,696评论 2 323
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,816评论 1 333
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,447评论 4 322
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 39,057评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,009评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,254评论 1 260
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 45,204评论 2 352
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,561评论 2 343