HOMER中findMotifGenome.pl中碰到的两个小错误

我有时候会对DIffBind的输出结果进行下修改，然后放到HOMER里面去找motif。但我有个不太好的一点就是，每次都不记我的awk操作，导致我每次都得重新输一遍awk。于是我今天在跑的时候就遇到了两个小bug。
第一个是HOMER的findMotifGenome.pl显示我有好多Redundant Peak IDs

findMotifsGenome.pl down.bed ~/reference/genome/TAIR10/Athaliana.fa .

    Position file = down.bed
    Genome = /home/sgd/reference/genome/TAIR10/Athaliana.fa
    Output Directory = .
    Using Custom Genome
    Peak/BED file conversion summary:
        BED/Header formatted lines: 1163
        peakfile formatted lines: 0

    Peak File Statistics:
        Total Peaks: 1163
        Redundant Peak IDs: 1162
        Peaks lacking information: 0 (need at least 5 columns per peak)
        Peaks with misformatted coordinates: 0 (should be integer)
        Peaks with misformatted strand: 0 (should be either +/- or 0/1)

但事实上我awk操作产生的并没有冗余的peak

awk -F "," 'BEGIN {OFS="\t"} $9 < -0.58 && $11 < 0.1 && NR > 1 {print $1,$2,$3,$5,$12}' test.csv | sed 's/\"//g' | sort -k1,1 -k2,2n > down.bed

Chr1    67143   67743   *       WT_Peak_5
Chr1    365682  366282  *       WT_Peak_28
Chr1    415434  416034  *       WT_Peak_30
Chr1    497934  498534  *       WT_Peak_41
Chr1    677310  677910  *       WT_Peak_58

我重新看了下HOMER的要求，才发现我的格式输出错误了……


findMotifsGenome.pl accepts HOMER peak files or BED files:

HOMER peak files should have at minimum 5 columns (separated by TABs, additional columns will be ignored):

    Column1: Unique Peak ID
    Column2: chromosome
    Column3: starting position
    Column4: ending position
    Column5: Strand (+/- or 0/1, where 0="+", 1="-")

BED files should have at minimum 6 columns (separated by TABs, additional columns will be ignored)

    Column1: chromosome
    Column2: starting position
    Column3: ending position
    Column4: Unique Peak ID
    Column5: not used
    Column6: Strand (+/- or 0/1, where 0="+", 1="-")

In theory, HOMER will accept BED files with only 4 columns (+/- in the 4th column), and files without unique IDs, but this is NOT recommended.  For one, if you don't have unique IDs for your regions, it's hard to go back and figure out which region contains which peak.

我输出的是bed格式，但我的第四列却是链信息而非Unique Peak ID。所以HOMER就会误把我的第四列变成了peakID，所以他就会认为我的peakID全都是冗余的。

在我改了列顺序之后，出现了另一个报错

awk -F "," 'BEGIN {OFS="\t"} $9 < -0.58 && $11 < 0.1 && NR > 1 {print $12,$1,$2,$3,$5}' test.csv | sed 's/\"//g' | sort -k2,2 -k3,3n > down.bed

WT_Peak_5   Chr1    67143   67743   *
WT_Peak_28  Chr1    365682  366282  *
WT_Peak_30  Chr1    415434  416034  *
WT_Peak_41  Chr1    497934  498534  *
WT_Peak_58  Chr1    677310  677910  *
WT_Peak_67  Chr1    898870  899470  *

类似这样的报错

Reading input files...
    0 total sequences read
    Autonormalization: 1-mers (4 total)
            A       inf%    inf%    -nan
            C       inf%    inf%    -nan
            G       inf%    inf%    -nan
            T       inf%    inf%    -nan
    Autonormalization: 2-mers (16 total)
            AA      inf%    inf%    -nan
            CA      inf%    inf%    -nan
            GA      inf%    inf%    -nan
            TA      inf%    inf%    -nan
            AC      inf%    inf%    -nan
            CC      inf%    inf%    -nan
            GC      inf%    inf%    -nan
            TC      inf%    inf%    -nan
            AG      inf%    inf%    -nan
            CG      inf%    inf%    -nan
            GG      inf%    inf%    -nan
            TG      inf%    inf%    -nan
            AT      inf%    inf%    -nan
            CT      inf%    inf%    -nan
            GT      inf%    inf%    -nan
            TT      inf%    inf%    -nan

后来我才发现，我应该把链信息中的 * 写成 .，这样就不会报错了。所以最后的输入文件应该是长这样的

awk -F "," 'BEGIN {OFS="\t"} $9 < -0.58 && $11 < 0.1 && NR > 1 {print $12,$1,$2,$3,"."}' test.csv | sed 's/\"//g' | sort -k2,2 -k3,3n > down.bed

WT_Peak_5   Chr1    67143   67743   .
WT_Peak_28  Chr1    365682  366282  .
WT_Peak_30  Chr1    415434  416034  .
WT_Peak_41  Chr1    497934  498534  .