使用conda安装好EDTA环境
$ EDTA.pl
#########################################################
##### Extensive de-novo TE Annotator (EDTA) v2.2.0 #####
##### Shujun Ou (shujun.ou.1@gmail.com) #####
#########################################################
Parameters:
At least 1 parameter is required:
1) Input fasta file: --genome
This is the Extensive de-novo TE Annotator that generates a high-quality
structure-based TE library. Usage:
perl EDTA.pl [options]
--genome [File] The genome FASTA file. Required.
--species [Rice|Maize|others] Specify the species for identification of TIR
candidates. Default: others
--step [all|filter|final|anno] Specify which steps you want to run EDTA.
all: run the entire pipeline (default)
filter: start from raw TEs to the end.
final: start from filtered TEs to finalizing the run.
anno: perform whole-genome annotation/analysis after
TE library construction.
--overwrite [0|1] If previous raw TE results are found, decide to overwrite
(1, rerun) or not (0, default).
--cds [File] Provide a FASTA file containing the coding sequence (no introns,
UTRs, nor TEs) of this genome or its close relative.
--curatedlib [File] Provided a curated library to keep consistant naming and
classification for known TEs. TEs in this file will be
trusted 100%, so please ONLY provide MANUALLY CURATED ones.
This option is not mandatory. It's totally OK if no file is
provided (default).
--rmlib [File] Provide the RepeatModeler library containing classified TEs to enhance
the sensitivity especially for LINEs. If no file is provided (default),
EDTA will generate such file for you.
--sensitive [0|1] Use RepeatModeler to identify remaining TEs (1) or not (0,
default). This step may help to recover some TEs.
--anno [0|1] Perform (1) or not perform (0, default) whole-genome TE annotation
after TE library construction.
--rmout [File] Provide your own homology-based TE annotation instead of using the
EDTA library for masking. File is in RepeatMasker .out format. This
file will be merged with the structural-based TE annotation. (--anno 1
required). Default: use the EDTA library for annotation.
--maxdiv [0-100] Maximum divergence (0-100%, default: 40) of repeat fragments comparing to
library sequences.
--evaluate [0|1] Evaluate (1) classification consistency of the TE annotation.
(--anno 1 required). Default: 1.
--exclude [File] Exclude regions (bed format) from TE masking in the MAKER.masked
output. Default: undef. (--anno 1 required).
--force [0|1] When no confident TE candidates are found: 0, interrupt and exit
(default); 1, use rice TEs to continue.
--u [float] Neutral mutation rate to calculate the age of intact LTR elements.
Intact LTR age is found in this file: *EDTA_raw/LTR/*.pass.list.
Default: 1.3e-8 (per bp per year, from rice).
--repeatmodeler [path] The directory containing RepeatModeler (default: read from ENV)
--repeatmasker [path] The directory containing RepeatMasker (default: read from ENV)
--annosine [path] The directory containing AnnoSINE_v2 (default: read from ENV)
--ltrretriever [path] The directory containing LTR_retriever (default: read from ENV)
--check_dependencies Check if dependencies are fullfiled and quit
--threads|-t [int] Number of theads to run this script (default: 4)
--debug [0|1] Retain intermediate files (default: 0)
--help|-h Display this help info
运行后发现LTR鉴定过程报错
Mon Aug 5 16:34:45 CST 2024 Obtain raw TE libraries using various structure-based programs:
Mon Aug 5 16:34:46 CST 2024 EDTA_raw: Check dependencies, prepare working directories.
Mon Aug 5 16:34:53 CST 2024 Start to find LTR candidates.
Mon Aug 5 16:34:53 CST 2024 Identify LTR retrotransposon candidates from scratch.
Invalid value for shared scalar at /home/x/miniconda3/envs/EDTA1/share/LTR_retriever/bin/LTR.identifier.pl line 114, <ANNO> line 384.
cp: cannot stat '0712.390m.last.fasta.mod.retriever.scn.adj': No such file or directory
awk: fatal: cannot open file `0712.390m.last.fasta.mod.pass.list' for reading: No such file or directory
Warning: LOC list - is empty.
Error: Error while loading sequence
Filter sequence based on TEsorter classifications. Unclassified sequences will also be output to the clean file.
Usage: perl cleanup_misclas.pl sequence.fa.rexdb.cls.tsv
Author: Shujun Ou (shujun.ou.1@gmail.com) 10/11/2019
mv: cannot stat '0712.390m.last.fasta.mod.LTR.intact.fa.ori.dusted.cln.cln': No such file or directory
mv: cannot stat '0712.390m.last.fasta.mod.LTR.intact.fa.ori.dusted.cln.cln.list': No such file or directory
cp: cannot stat '0712.390m.last.fasta.mod.LTR.intact.raw.fa.anno.list': No such file or directory
ERROR: No such file or directory at /home/x/miniconda3/envs/EDTA1/share/EDTA/util/output_by_list.pl line 39.
perl filter_gff3.pl file.gff3 file.list > new.gff3
Mon Aug 5 16:40:01 CST 2024 Warning: The LTR result file has 0 bp!
github搜索,发现问题与LTR_retriever有关,遂指定LTR_retriever路径,报错相同。
改变思路,使用docker运行程序,这一步的性能消耗不高,400m的基因组,30G内存足矣,分12线程也能很快跑完。
https://quay.io/repository/biocontainers/edta?tab=tags #找到最新的容器,下载
#我是在win下运行的,下载好tar之后导入镜像。本来看着很小只有2g,导出tar还挺大的,7.12G
Mode LastWriteTime Length Name
---- ------------- ------ ----
-a---- 2024/8/6 17:08 7655989248 quay.io_biocontainers_edta_2.2.0--hdfd78af_1.tar
-a---- 2024/8/2 14:50 8215 新建 文本文档.txt
docker load quay.io_biocontainers_edta_2.2.0--hdfd78af_1.tar
PS D:\bio_data\genome_make> docker run -v ${PWD}:/in -w /in quay.io/biocontainers/edta:2.2.0--hdfd78af_1 EDTA.pl --genome ./0712.400m.last.fasta -t 12
#正常运行
#########################################################
##### Extensive de-novo TE Annotator (EDTA) v2.2.0 #####
##### Shujun Ou (shujun.ou.1@gmail.com) #####
#########################################################
Parameters: --genome ./0712.400m.last.fasta -t 12
Tue Aug 6 09:24:29 UTC 2024 Dependency checking:
All passed!
Tue Aug 6 09:25:43 UTC 2024 Obtain raw TE libraries using various structure-based programs:
Tue Aug 6 09:25:43 UTC 2024 EDTA_raw: Check dependencies, prepare working directories.
Tue Aug 6 09:26:33 UTC 2024 Start to find LTR candidates.
Tue Aug 6 09:26:33 UTC 2024 Identify LTR retrotransposon candidates from scratch.
Tue Aug 6 09:34:51 UTC 2024 Finish finding LTR candidates.
Tue Aug 6 09:34:51 UTC 2024 Start to find SINE candidates.
从任务管理器可以看到峰值内存占用约20G。正常运行。不用跟conda搏斗的感觉真是太好了。