Canu 参数调整
For all stages: 所有阶段
-
rawErrorRate
is the maximum expected difference in an alignment of two uncorrected reads. It is a meta-parameter that sets other parameters.
设置两个未纠错overlap reads之间最大期望差异, 一般不用调整 -
correctedErrorRate
is the maximum expected difference in an alignment of two corrected reads. It is a meta-parameter that sets other parameters. (If you’re used to theerrorRate
parameter, multiply that by 3 and use it here.)
在两个修正的reads之间的重叠的允许差异,用错误分数表示。这个参数需要在组装时多次调整。提高纠错率将增加运行时间,同样,降低纠错率将会减少运行时间,但会有丢失重叠和破坏组装的风险。PacBio的默认值为0.045
, Nanopore默认为0.144
。
对于低覆盖率的数据集(小于30X),我们建议将纠正率提高 0.01左右;
对于高覆盖率的数据集(超过60X),我们建议将纠正率降低 0.01左右。 -
minReadLength
andminOverlapLength
. The defaults are to discard reads shorter than 1000bp and to not look for overlaps shorter than 500bp. IncreasingminReadLength
can improve run time, and increasingminOverlapLength
can improve assembly quality by removing false overlaps. However, increasing either too much will quickly degrade assemblies by either omitting valuable reads or missing true overlaps.
最小reads长度和最小Overlap长度,提高minReadLength可以提高运行速度,增加minOverlapLength可以降低假阳性的overlap。 -
minReadLength
最小reads长度,默认1000,一定要比minOverlapLength
大。如果设置足够高,gatekeeper模块将声称输入中有错误,因为太多的输入reads已经被丢弃。不过只要有足够的覆盖度,这就不是问题。 -
minOverlapLength
最小Overlap长度,默认500,一定要比minReadLength
小。 较小的值可以用来克服reads覆盖度的不足,但也会导致错误的重叠和潜在的错误组装。较大的值将导致更多正确的组装,但会产生更多的碎片。 -
genomeSize
对基因组大小的估计,例如3.7m或2.8g。基因组大小估计用于决定需要纠正多少reads(通过corOutCoverage参数),以及mhap overlapper应该有多敏感(通过mhapSensitivity参数)。它还会影响一些日志记录,特别是N50大小的报告。
For correction: 纠错阶段
-
corOutCoverage
controls how much coverage in corrected reads is generated. The default is to target 40X, but, for various reasons, this results in 30X to 35X of reads being generated.
控制在已纠错的reads中生成的覆盖度,默认的目标是40X,但是由于各种原因,这会生成30X到35X的reads
-
corMinCoverage
, loosely, controls the quality of the corrected reads. It is the coverage in evidence reads that is needed before a (portion of a) corrected read is reported. Corrected reads are generated as a consensus of other reads; this is just the minimum coverage needed for the consensus sequence to be reported. The default is based on input read coverage: 0x coverage for less than 30X input coverage, and 4x coverage for more than that.
控制校正reads的质量(0,4)
For assembly: 拼接阶段
-
utgOvlErrorRate
is essentially a speed optimization. Overlaps above this error rate are not computed. Setting it too high generally just wastes compute time, while setting it too low will degrade assemblies by missing true overlaps between lower quality reads.
速度优化,一般无需调整 -
utgGraphDeviation
andutgRepeatDeviation
what quality of overlaps are used in contig construction or in breaking contigs at false repeat joins, respectively. Both are in terms of a deviation from the mean error rate in the longest overlaps.
不调整 -
utgRepeatConfusedBP
controls how similar a true overlap (between two reads in the same contig) and a false overlap (between two reads in different contigs) need to be before the contig is split. When this occurs, it isn’t clear which overlap is ‘true’ - the longer one or the slightly shorter one - and the contig is split to avoid misassemblies.
不调整
For polyploid genomes: 对于多倍体基因组
Generally, there’s a couple of ways of dealing with the ploidy.
Avoid collapsing the genome so you end up with double (assuming diploid) the genome size as long as your divergence is above about 2% (for PacBio data). Below this divergence, you’d end up collapsing the variations. We’ve used the following parameters for polyploid populations (PacBio data): 避免基因组塌缩。因此,只要差异在2%以上(对于PacBio数据),基因组的大小就会翻倍(假设是二倍体);若差异在2% 以下,则会把这些变异折叠起来。我们对多倍体种群使用了以下参数(PacBio数据)
corOutCoverage=200 "batOptions=-dg 3 -db 3 -dr 1 -ca 500 -cp 50"
This will output more corrected reads (than the default 40x). The latter option will be more conservative at picking the error rate to use for the assembly to try to maintain haplotype separation. If it works, you’ll end up with an assembly >= 2x your haploid genome size. Post-processing using gene information or other synteny information is required to remove redundancy from this assembly.
这将输出更多的纠正reads(默认40x), 后一项参数在选择用于组装以尽量保持单倍型分离的错误率方面更为保守。如果成功,你将得到一个装配体>= 2倍单倍体基因组大小。使用基因信息或其他同步信息的后处理需要从这个组装中去除冗余pug_dups。Smash haplotypes together and then do phasing using another approach (like HapCUT2 or whatshap or others). In that case you want to do the opposite, increase the error rates used for finding overlaps:
将单倍型粉碎在一起(不推荐)
corOutCoverage=200 correctedErrorRate=0.15
When trimming, reads will be trimmed using other reads in the same chromosome (and probably some reads from other chromosomes). When assembling, overlaps well outside the observed error rate distribution are discarded.
We strongly recommend option 1 which will lead to a larger than expected genome size. See My genome size and assembly size are different, help! for details on how to remove this duplication.
我们通常倾向于选项1,这将导致比预期更大的基因组大小。我们已经(在有限的测试中)成功地使用了pug_dups 去除冗余。
For metagenomes:
The basic idea is to use all data for assembly rather than just the longest as default. The parameters we’ve used recently are:
corOutCoverage=10000 corMhapSensitivity=high corMinCoverage=0 redMemory=32 oeaMemory=32 batMemory=200
For low coverage:
For less than 30X coverage, increase the alllowed difference in overlaps by a few percent (from 4.5% to 8.5% (or more) with correctedErrorRate=0.105 for PacBio and from 14.4% to 16% (or more) with correctedErrorRate=0.16 for Nanopore), to adjust for inferior read correction. Canu will automatically reduce corMinCoverage to zero to correct as many reads as possible.
For high coverage:
For more than 60X coverage, decrease the allowed difference in overlaps (from 4.5% to 4.0% with correctedErrorRate=0.040 for PacBio, from 14.4% to 12% with correctedErrorRate=0.12 for Nanopore), so that only the better corrected reads are used. This is primarily an optimization for speed and generally does not change assembly continuity.