之前我写过一篇利用3D-DNA流程基于Hi-C提升基因组组装,那个时候我做的项目并不多,也没有遇到坑,直到最近我用3D-DNA流程进行基因组组装,发现结果出乎意料
在上图中,蓝色框表示组装出来的scaffold,其余则是3D-DNA认为的debris,也就是边角料。谁能想到一个基因组有那么多的边角料呢?
经 @上海欧易生物-鲍志贵的提醒,我检查了genome.0.hic, 也就是初步组装的结果,发现结果好的惊人。
谁能想到,越纠错结果越错呢?现在的问题是,既然错误已经存在了,那么应该如何处理呢?
最容易的解决方法就是设置 -r 0
, 也就是不纠错,直接用genome.0.hic的结果开始拆分染色体,后续通juicebox(https://github.com/aidenlab/Juicebox/wiki)手动处理潜在组装错误。这其实并没有解决问题,而是躲避了问题。于是我去阅读了3D-DNA 发表在science上文章的附录,和run-asm-pipeline.sh的源代码,功夫不负有心人,终于被我定位到了问题的关键所在。
run-asm-pipeline.sh
的651-750是3D-DNA流程的核心算法: misjoin-correction. 其中下面的代码是用来找到潜在的mis-assembly, 用于后续纠错
bash ${pipeline}/edit/run-mismatch-detector.sh -p ${parallel} -c ${editor_saturation_centile} -w ${editor_coarse_resolution} -d ${editor_coarse_region} -k ${editor_coarse_stringency} -n ${editor_fine_resolution} ${genomeid}.${ROUND}.hic
bash ${pipeline}/edit/run-coverage-analyzer.sh -w ${editor_coarse_resolution} -t ${editor_repeat_coverage} ${genomeid}.${ROUND}.hic
输出的 mismatch_narrow.at.step.${ROUND}.bed
和 repeats_wide.at.step.${ROUND}.bed
后续会被3D-DNA用于编辑原始的contig/scaffold。我分别统计了这两者的数量和长度, 其中mismatch长度为1,也就是没有发现组装错误,而重复区域则非常可怕
awk '{print $3-$2}' repeats_wide.at.step.0.bed | sort -k1,1nr | head
13775000
11450000
9200000
7075000
5325000
4350000
4225000
4225000
4175000
4100000
最长的重复区域居然长达13 Mb, 想想都不太可能啊(我统计了另一组数据,长度最长为50K)。显然这就是问题的所在,接下来就该解决问题了。
既然是重复区域出了问题,我翻阅了参数列表,找到了一个与其相关的参数,即--editor-repeat-coverage
, 默认是2,我们可以运行run-coverage-analyzer.sh
, 测试不同参数下(-t
)下的重复序列长度情况,从而确定一个比较合适的大小。
bash /opt/biosoft/3d-dna/edit/run-coverage-analyzer.sh -w 25000 -t 3 genome.0.hic
此外,我还查阅了在Google Group上相关的讨论,发现这个问题是由于文库导致(Most likely it is due to coverage biases in your experiment. Load the tracks associated with repeats to help highlight. You can increase --editor-repeat-coverage to help mitigate.)
Highly likely repeats is main problem of this assembly. Significant part of Hi-C reads are filtered as below MAPQ but checking reads shows non-unique mapping at least one side of such reads, and it can be result of presence large fraction of repeats. As well as remapping reads to selected after JBAT chromosome-size scaffolds shows coverage anomalies looks like crosses with much more higher contact frequency than other genome, probably corresponding to condensed repeats. And can i ask your advise how should we deal with such repeats mark with repeatFinder and hardmask before reads mapping or something else? And we try to increase editor-repeat-coverage parameter.
参考资料: