我有时候会对DIffBind的输出结果进行下修改,然后放到HOMER里面去找motif。但我有个不太好的一点就是,每次都不记我的awk操作,导致我每次都得重新输一遍awk。于是我今天在跑的时候就遇到了两个小bug。
第一个是HOMER的findMotifGenome.pl显示我有好多Redundant Peak IDs
findMotifsGenome.pl down.bed ~/reference/genome/TAIR10/Athaliana.fa .
Position file = down.bed
Genome = /home/sgd/reference/genome/TAIR10/Athaliana.fa
Output Directory = .
Using Custom Genome
Peak/BED file conversion summary:
BED/Header formatted lines: 1163
peakfile formatted lines: 0
Peak File Statistics:
Total Peaks: 1163
Redundant Peak IDs: 1162
Peaks lacking information: 0 (need at least 5 columns per peak)
Peaks with misformatted coordinates: 0 (should be integer)
Peaks with misformatted strand: 0 (should be either +/- or 0/1)
但事实上我awk操作产生的并没有冗余的peak
awk -F "," 'BEGIN {OFS="\t"} $9 < -0.58 && $11 < 0.1 && NR > 1 {print $1,$2,$3,$5,$12}' test.csv | sed 's/\"//g' | sort -k1,1 -k2,2n > down.bed
Chr1 67143 67743 * WT_Peak_5
Chr1 365682 366282 * WT_Peak_28
Chr1 415434 416034 * WT_Peak_30
Chr1 497934 498534 * WT_Peak_41
Chr1 677310 677910 * WT_Peak_58
我重新看了下HOMER的要求,才发现我的格式输出错误了……
findMotifsGenome.pl accepts HOMER peak files or BED files:
HOMER peak files should have at minimum 5 columns (separated by TABs, additional columns will be ignored):
Column1: Unique Peak ID
Column2: chromosome
Column3: starting position
Column4: ending position
Column5: Strand (+/- or 0/1, where 0="+", 1="-")
BED files should have at minimum 6 columns (separated by TABs, additional columns will be ignored)
Column1: chromosome
Column2: starting position
Column3: ending position
Column4: Unique Peak ID
Column5: not used
Column6: Strand (+/- or 0/1, where 0="+", 1="-")
In theory, HOMER will accept BED files with only 4 columns (+/- in the 4th column), and files without unique IDs, but this is NOT recommended. For one, if you don't have unique IDs for your regions, it's hard to go back and figure out which region contains which peak.
我输出的是bed格式,但我的第四列却是链信息而非Unique Peak ID。所以HOMER就会误把我的第四列变成了peakID,所以他就会认为我的peakID全都是冗余的。
在我改了列顺序之后,出现了另一个报错
awk -F "," 'BEGIN {OFS="\t"} $9 < -0.58 && $11 < 0.1 && NR > 1 {print $12,$1,$2,$3,$5}' test.csv | sed 's/\"//g' | sort -k2,2 -k3,3n > down.bed
WT_Peak_5 Chr1 67143 67743 *
WT_Peak_28 Chr1 365682 366282 *
WT_Peak_30 Chr1 415434 416034 *
WT_Peak_41 Chr1 497934 498534 *
WT_Peak_58 Chr1 677310 677910 *
WT_Peak_67 Chr1 898870 899470 *
类似这样的报错
Reading input files...
0 total sequences read
Autonormalization: 1-mers (4 total)
A inf% inf% -nan
C inf% inf% -nan
G inf% inf% -nan
T inf% inf% -nan
Autonormalization: 2-mers (16 total)
AA inf% inf% -nan
CA inf% inf% -nan
GA inf% inf% -nan
TA inf% inf% -nan
AC inf% inf% -nan
CC inf% inf% -nan
GC inf% inf% -nan
TC inf% inf% -nan
AG inf% inf% -nan
CG inf% inf% -nan
GG inf% inf% -nan
TG inf% inf% -nan
AT inf% inf% -nan
CT inf% inf% -nan
GT inf% inf% -nan
TT inf% inf% -nan
后来我才发现,我应该把链信息中的 *
写成 .
, 这样就不会报错了。所以最后的输入文件应该是长这样的
awk -F "," 'BEGIN {OFS="\t"} $9 < -0.58 && $11 < 0.1 && NR > 1 {print $12,$1,$2,$3,"."}' test.csv | sed 's/\"//g' | sort -k2,2 -k3,3n > down.bed
WT_Peak_5 Chr1 67143 67743 .
WT_Peak_28 Chr1 365682 366282 .
WT_Peak_30 Chr1 415434 416034 .
WT_Peak_41 Chr1 497934 498534 .