使用Canu=1.8拼接基因组的参数设置

Canu 参数调整

For all stages: 所有阶段

  • rawErrorRate is the maximum expected difference in an alignment of two uncorrected reads. It is a meta-parameter that sets other parameters.
    设置两个未纠错overlap reads之间最大期望差异, 一般不用调整
  • correctedErrorRate is the maximum expected difference in an alignment of two corrected reads. It is a meta-parameter that sets other parameters. (If you’re used to the errorRateparameter, multiply that by 3 and use it here.)
    在两个修正的reads之间的重叠的允许差异,用错误分数表示。这个参数需要在组装时多次调整。提高纠错率将增加运行时间,同样,降低纠错率将会减少运行时间,但会有丢失重叠和破坏组装的风险。PacBio的默认值为0.045, Nanopore默认为0.144
    对于低覆盖率的数据集(小于30X),我们建议将纠正率提高 0.01左右;
    对于高覆盖率的数据集(超过60X),我们建议将纠正率降低 0.01左右。
  • minReadLength and minOverlapLength. The defaults are to discard reads shorter than 1000bp and to not look for overlaps shorter than 500bp. Increasing minReadLength can improve run time, and increasing minOverlapLength can improve assembly quality by removing false overlaps. However, increasing either too much will quickly degrade assemblies by either omitting valuable reads or missing true overlaps.
    最小reads长度和最小Overlap长度,提高minReadLength可以提高运行速度,增加minOverlapLength可以降低假阳性的overlap。
  • minReadLength 最小reads长度,默认1000,一定要比minOverlapLength大。如果设置足够高,gatekeeper模块将声称输入中有错误,因为太多的输入reads已经被丢弃。不过只要有足够的覆盖度,这就不是问题。
  • minOverlapLength最小Overlap长度,默认500,一定要比minReadLength小。 较小的值可以用来克服reads覆盖度的不足,但也会导致错误的重叠和潜在的错误组装。较大的值将导致更多正确的组装,但会产生更多的碎片。
  • genomeSize 对基因组大小的估计,例如3.7m或2.8g。基因组大小估计用于决定需要纠正多少reads(通过corOutCoverage参数),以及mhap overlapper应该有多敏感(通过mhapSensitivity参数)。它还会影响一些日志记录,特别是N50大小的报告。

For correction: 纠错阶段

  • corOutCoverage controls how much coverage in corrected reads is generated. The default is to target 40X, but, for various reasons, this results in 30X to 35X of reads being generated.

控制在已纠错的reads中生成的覆盖度,默认的目标是40X,但是由于各种原因,这会生成30X到35X的reads

  • corMinCoverage, loosely, controls the quality of the corrected reads. It is the coverage in evidence reads that is needed before a (portion of a) corrected read is reported. Corrected reads are generated as a consensus of other reads; this is just the minimum coverage needed for the consensus sequence to be reported. The default is based on input read coverage: 0x coverage for less than 30X input coverage, and 4x coverage for more than that.
    控制校正reads的质量(0,4)

For assembly: 拼接阶段

  • utgOvlErrorRate is essentially a speed optimization. Overlaps above this error rate are not computed. Setting it too high generally just wastes compute time, while setting it too low will degrade assemblies by missing true overlaps between lower quality reads.
    速度优化,一般无需调整
  • utgGraphDeviation and utgRepeatDeviation what quality of overlaps are used in contig construction or in breaking contigs at false repeat joins, respectively. Both are in terms of a deviation from the mean error rate in the longest overlaps.
    不调整
  • utgRepeatConfusedBP controls how similar a true overlap (between two reads in the same contig) and a false overlap (between two reads in different contigs) need to be before the contig is split. When this occurs, it isn’t clear which overlap is ‘true’ - the longer one or the slightly shorter one - and the contig is split to avoid misassemblies.
    不调整

For polyploid genomes: 对于多倍体基因组

Generally, there’s a couple of ways of dealing with the ploidy.

  1. Avoid collapsing the genome so you end up with double (assuming diploid) the genome size as long as your divergence is above about 2% (for PacBio data). Below this divergence, you’d end up collapsing the variations. We’ve used the following parameters for polyploid populations (PacBio data): 避免基因组塌缩。因此,只要差异在2%以上(对于PacBio数据),基因组的大小就会翻倍(假设是二倍体);若差异在2% 以下,则会把这些变异折叠起来。我们对多倍体种群使用了以下参数(PacBio数据)
    corOutCoverage=200 "batOptions=-dg 3 -db 3 -dr 1 -ca 500 -cp 50"
    This will output more corrected reads (than the default 40x). The latter option will be more conservative at picking the error rate to use for the assembly to try to maintain haplotype separation. If it works, you’ll end up with an assembly >= 2x your haploid genome size. Post-processing using gene information or other synteny information is required to remove redundancy from this assembly.
    这将输出更多的纠正reads(默认40x), 后一项参数在选择用于组装以尽量保持单倍型分离的错误率方面更为保守。如果成功,你将得到一个装配体>= 2倍单倍体基因组大小。使用基因信息或其他同步信息的后处理需要从这个组装中去除冗余pug_dups

  2. Smash haplotypes together and then do phasing using another approach (like HapCUT2 or whatshap or others). In that case you want to do the opposite, increase the error rates used for finding overlaps:
    将单倍型粉碎在一起(不推荐)
    corOutCoverage=200 correctedErrorRate=0.15
    When trimming, reads will be trimmed using other reads in the same chromosome (and probably some reads from other chromosomes). When assembling, overlaps well outside the observed error rate distribution are discarded.

We strongly recommend option 1 which will lead to a larger than expected genome size. See My genome size and assembly size are different, help! for details on how to remove this duplication.
我们通常倾向于选项1,这将导致比预期更大的基因组大小。我们已经(在有限的测试中)成功地使用了pug_dups 去除冗余。

For metagenomes:

The basic idea is to use all data for assembly rather than just the longest as default. The parameters we’ve used recently are:

corOutCoverage=10000 corMhapSensitivity=high corMinCoverage=0 redMemory=32 oeaMemory=32 batMemory=200

For low coverage:

For less than 30X coverage, increase the alllowed difference in overlaps by a few percent (from 4.5% to 8.5% (or more) with correctedErrorRate=0.105 for PacBio and from 14.4% to 16% (or more) with correctedErrorRate=0.16 for Nanopore), to adjust for inferior read correction. Canu will automatically reduce corMinCoverage to zero to correct as many reads as possible.

For high coverage:

For more than 60X coverage, decrease the allowed difference in overlaps (from 4.5% to 4.0% with correctedErrorRate=0.040 for PacBio, from 14.4% to 12% with correctedErrorRate=0.12 for Nanopore), so that only the better corrected reads are used. This is primarily an optimization for speed and generally does not change assembly continuity.

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 205,236评论 6 478
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 87,867评论 2 381
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 151,715评论 0 340
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,899评论 1 278
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,895评论 5 368
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,733评论 1 283
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 38,085评论 3 399
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,722评论 0 258
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 43,025评论 1 300
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,696评论 2 323
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,816评论 1 333
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,447评论 4 322
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 39,057评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,009评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,254评论 1 260
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 45,204评论 2 352
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,561评论 2 343

推荐阅读更多精彩内容