个性化ChIPseeker注释peak结果，基因 or 功能区类型随心选，好用的脚本等你来取

ChIP-seq、ATAC-seq等获取到基因组范围内的peak后，需要对其进行基因注释以便进行后续的分析。ChIPseeker做为peak注释的R包，用来很不错，但有时候其结果还是满足不了个性化的需要。比如，只想关注某些特定类型的基因，如protein_coding、lincRNA、snRNA，这个时候就得自己想办法添加基因类型了，因为ChIPseeker注释结果没有这些信息。
正好本人遇到过这类的需求，那就分享一下自己的解决办法，让需要的朋友可以不必浪费时间便可轻松跨过这个小坑。下面用macs2的peak结果来说明：

chr1    3472147 3472924 sample1_peak_1  144     .       7.43822 16.84734        14.45499        236
chr1    3670777 3672287 sample1_peak_2  143     .       6.49641 16.75010        14.36088        485
chr1    3872843 3874430 sample1_peak_3  631     .       9.24537 66.46637        63.18048        630
chr1    4234288 4235914 sample1_peak_4  98      .       5.48185 12.13378        9.86325 1209
chr1    4496702 4497572 sample1_peak_5  87      .       5.65304 10.96738        8.72978 689
chr1    4590520 4591742 sample1_peak_6  242     .       6.89093 26.86273        24.25464        996
chr1    4608160 4611414 sample1_peak_7  275     .       3.18758 30.21178        27.53851        2034
chr1    4629533 4629848 sample1_peak_8  53      .       4.46293 7.47678 5.35324 81
chr1    4664981 4665332 sample1_peak_9  38      .       3.86787 5.89028 3.82913 183
chr1    4722306 4724297 sample1_peak_10 403     .       4.77592 43.30846        40.39729        1322

ChIPseeker默认情况下注释的结果：

  seqnames   start     end width strand             V4  V5 V6      V7       V8
1     chr1 3472148 3472924   777      * sample1_peak_1 144  . 7.43822 16.84734
2     chr1 3670778 3672287  1510      * sample1_peak_2 143  . 6.49641 16.75010
3     chr1 3872844 3874430  1587      * sample1_peak_3 631  . 9.24537 66.46637
4     chr1 4234289 4235914  1626      * sample1_peak_4  98  . 5.48185 12.13378
5     chr1 4496703 4497572   870      * sample1_peak_5  87  . 5.65304 10.96738
6     chr1 4590521 4591742  1222      * sample1_peak_6 242  . 6.89093 26.86273
        V9  V10
1 14.45499  236
2 14.36088  485
3 63.18048  630
4  9.86325 1209
5  8.72978  689
6 24.25464  996
                                                            annotation geneChr
1    Intron (ENSMUST00000161581.1/ENSMUSG00000089699.1, intron 1 of 1)       1
2                                                     Promoter (<=1kb)       1
3                                                    Distal Intergenic       1
4 Intron (ENSMUST00000208660.1/ENSMUSG00000025900.13, intron 12 of 29)       1
5                                                     Promoter (<=1kb)       1
6                                                    Distal Intergenic       1
  geneStart geneEnd geneLength geneStrand                geneId
1   3464977 3467285       2309          2  ENSMUSG00000103025.1
2   3214482 3671498     457017          2  ENSMUSG00000051951.5
3   3783876 3783933         58          2  ENSMUSG00000088333.2
4   4256234 4260519       4286          2  ENSMUSG00000102948.1
5   4491250 4496757       5508          2 ENSMUSG00000025902.13
6   4583129 4586252       3124          2  ENSMUSG00000104328.1
          transcriptId distanceToTSS
1 ENSMUST00000194099.1         -4863
2 ENSMUST00000070533.4             0
3 ENSMUST00000157708.2        -88911
4 ENSMUST00000195384.1         24605
5 ENSMUST00000195555.1             0
6 ENSMUST00000195771.1         -4269

在原有的基础追加了gene_name、gene_type两列信息：

seqnames   start     end width           name score stand  signal   pvalue
1     chr1 3472148 3472924   777 sample1_peak_1   144     . 7.43822 16.84734
2     chr1 3670778 3672287  1510 sample1_peak_2   143     . 6.49641 16.75010
3     chr1 3872844 3874430  1587 sample1_peak_3   631     . 9.24537 66.46637
4     chr1 4234289 4235914  1626 sample1_peak_4    98     . 5.48185 12.13378
5     chr1 4496703 4497572   870 sample1_peak_5    87     . 5.65304 10.96738
6     chr1 4590521 4591742  1222 sample1_peak_6   242     . 6.89093 26.86273
    qvalue summit        annotation geneChr geneStart geneEnd geneLength
1 14.45499    236            Intron    chr1   3464977 3467285       2309
2 14.36088    485  Promoter (<=1kb)    chr1   3214482 3671498     457017
3 63.18048    630 Distal Intergenic    chr1   3783876 3783933         58
4  9.86325   1209            Intron    chr1   4256234 4260519       4286
5  8.72978    689  Promoter (<=1kb)    chr1   4491250 4496757       5508
6 24.25464    996 Distal Intergenic    chr1   4583129 4586252       3124
  geneStrand                geneId         transcriptId distanceToTSS gene_name
1          2  ENSMUSG00000103025.1 ENSMUST00000194099.1         -4863   Gm37686
2          2  ENSMUSG00000051951.5 ENSMUST00000070533.4             0      Xkr4
3          2  ENSMUSG00000088333.2 ENSMUST00000157708.2        -88911   Gm27396
4          2  ENSMUSG00000102948.1 ENSMUST00000195384.1         24605    Gm6101
5          2 ENSMUSG00000025902.13 ENSMUST00000195555.1             0     Sox17
6          2  ENSMUSG00000104328.1 ENSMUST00000195771.1         -4269   Gm37323
             gene_type
1                  TEC
2       protein_coding
3                snRNA
4 processed_pseudogene
5       protein_coding
6              lincRNA

为了方便以后遇到类似需求，写了一个R脚本，可以用来做这个事，分析时只需在命令直接调用即可，方便快捷：

#!/usr/bin/Rscript
suppressPackageStartupMessages(library(optparse))

option_list = list(make_option(c("-i","--input"), help = "peak file, multiple are separated by commas."),
                   make_option(c("-e","--species"), help = "optional value is human or mouse, [default: %default]."),
                   make_option(c("-g","--gtf"), help = "genome gtf, [default: %default]."),
                   make_option(c("-l","--label"), help = "sample label, multiple are separated by commas, [default: %default]."),
                   make_option(c("-a","--atype"), help = "annotation type for analysis, multiple are separated by commas, [default: %default]."),
                   make_option(c("-f","--ftype"), help = "gene type for analysis, multiple are separated by commas, [default: %default]."),
                   make_option(c("-u","--up"), default = 3000, help = "length of tss upstream, [default: %default]."),
                   make_option(c("-d","--dn"), default = 3000, help = "length of tss downstream, [default: %default]."),
                   make_option(c("-c","--cname"), default='name,score,stand,signal,pvalue,qvalue,summit', help = "colnames from 3td column to end in peak file, [default: '%default']."),
                   make_option(c("-s","--save"), default = FALSE, action="store_true", help = "whether to save annotation txt, [default: %default]."),
                   make_option(c("-p","--prefix"), default = 'chipseeker', help = "the prefix for output, [default: '%default']."),
                   make_option(c("-o","--outdir"), default = '.', help = "output directory, [default: '%default']."))
opt_parser <- OptionParser(option_list = option_list,
                           description='\n\tThis script uesed to annotate peak using gtf by ChIPseeker, default format is for macs2 result.
                                        \ntxdb: TxDb.Hsapiens.UCSC.hg38.knownGene, TxDb.Mmusculus.UCSC.mm10.knownGene.
                                        \nannotation type: Promoter, Exon, Intron, UTR, Downstream, Intergenic.
                                        \ngene type: feature in gtf, protein_coding, snRNA, et al.')
args <- parse_args(opt_parser)

if(is.null(args$input) & (is.null(args$species) | is.null(args$gtf))){
    print_help(opt_parser)
    stop("argument input and species or gtf are required.")
}

suppressPackageStartupMessages(library(ChIPseeker))
suppressPackageStartupMessages(library(ggplot2))

if(!dir.exists(args$outdir)){
   dir.create(args$outdir)
}

plotpie <- function(data){
   if(class(data)=='csAnno'){
      pdata <- data@annoStat
   }else{
      pdata <- data
   }
   pdata$ratio <- round(pdata$Frequency, 2)
   pdata$label <- paste0(pdata$Feature, ' (', pdata$ratio,'%)')
   col <- ChIPseeker:::getCols(nrow(pdata))
   p <- ggplot(pdata,aes(x=1,ratio,fill=factor(label, levels=label))) +
          geom_bar(stat='identity', color="black", show.legend=T) +
          labs(fill="") +
          coord_polar("y", start=0) +
          scale_y_reverse() +
          theme_void() +
          scale_fill_manual(values=col) +
          theme(legend.spacing.y=unit(0.1,'cm'), legend.box.margin=margin(1,10,1,-12), legend.key.size=unit(0.15,'inches'), legend.text=element_text(size=8)) +
          guides(fill = guide_legend(byrow = T))
   return(p)
}

anno <- function(input,species,gtf,label,atype,ftype,up,dn,cname,save,prefix,outdir){
   if(!is.null(gtf) & !is.null(species)){
      print('both arguments gtf and species are provided, will use gtf to annotate.')
   }

   if(!is.null(gtf)){
      txdb <- GenomicFeatures::makeTxDbFromGFF(gtf)
      egdb <- NULL
      gtftab <- as.data.frame(rtracklayer::import(gtf))
      gtftab <- unique(gtftab[gtftab$type == 'transcript', c('transcript_id', 'gene_name', 'gene_type')])
   }else if(!is.null(species)){
       if(species == 'human'){
         suppressPackageStartupMessages(library(TxDb.Hsapiens.UCSC.hg38.knownGene))
         txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
         egdb <- 'org.Hs.eg.db'
      }else if(species == 'mouse'){
         suppressPackageStartupMessages(library(TxDb.Mmusculus.UCSC.mm10.knownGene))
         txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene
         egdb <- 'org.Mm.eg.db'
      }else{
         stop("argument species is human or mouse.")
      }
   }else{
      stop("please providing argument species or gtf.")
   }

   files <- strsplit(input, split='\\,')[[1]]

   if(!is.null(label)){
      labs <- strsplit(label, split='\\,')[[1]]
   }else{
      labs <- sapply(basename(files), tools::file_path_sans_ext)
   }

  alist <- list()
  for(idx in 1:length(files)){
     peak <- readPeakFile(files[idx])
     peakanno <- annotatePeak(peak, tssRegion=c(-up, dn), TxDb=txdb, annoDb=egdb)

     if(!is.null(atype)){
        atype <- strsplit(atype, split='\\,')[[1]]
        peakanno <- subset(peakanno, annotation %in% grep(paste0(atype, collapse='|'), annotation, value=T))
     }

     if(!is.null(ftype) & !is.null(gtf)){
        types <- strsplit(ftype, split='\\,')[[1]]
        genes <- gtftab[gtftab$gene_type %in% types, 'transcript_id']
        peakanno <- subset(peakanno, transcriptId %in% genes)
     }else if(!is.null(ftype) & is.null(gtf)){
        print("warning: argument ftype takes effect only when gtf is provided.")
     }

     if(save){
        out <- as.data.frame(peakanno)
        ord <- out[, 6]
        if(length(grep('^V',colnames(out))) != 0){
           out <- base::subset(out, select=-strand)
           colnames(out)[grep('^V',colnames(out))] <- strsplit(cname, split='\\,')[[1]]
        }
        out$geneChr <- out$seqnames
        out$annotation <- sapply(out$annotation,function(x){ifelse(grepl('^Intron|^Exon', x), strsplit(x,split=' ')[[1]][1], x)})

        if(!is.null(gtf)){
           out$transcript_id <- out$transcriptId
           out <- merge(out, gtftab, all.x=T)
           out <- out[-1]
           out <- out[match(ord, out[,5]),]
        }else{
           out$geneId <- ifelse(is.na(out$ENSEMBL),out$geneId,out$ENSEMBL)
           out <- base::subset(out, select=-c(ENSEMBL, GENENAME))
        }
        write.table(out,paste0(outdir,'/',prefix,'.',labs[idx],'.anno.txt'),sep='\t',quote=F,row.names=F)
     }

     alist[[labs[idx]]] <- peakanno
  }

  if(length(alist)>1){
     p <- plotAnnoBar(alist)
     ggsave(paste0(outdir,'/',prefix,'.anno.pdf'), p, width=6, height = 2.5 + length(alist)/2)
  }else{
     #plotAnnoPie(alist[[1]])
     p <- plotpie(alist[[1]])
     ggsave(paste0(outdir,'/',prefix,'.',labs[1],'.anno.pdf'), p, width=5, height=3)
  }
}

anno(args$input,args$species,args$gtf,args$label,args$atype,args$ftype,args$up,args$dn,args$cname,args$save,args$prefix,args$outdir)

为了保证脚本可以正常使用，需要保证环境里面已安装optparse、ChIPseeker、ggplot2、rtracklayer、GenomicFeatures。如果不提供gtf，还需要TxDb.Hsapiens.UCSC.hg38.knownGene、TxDb.Mmusculus.UCSC.mm10.knownGene、org.Hs.eg.db、org.Mm.eg.db数据库包。
由于经常遇到人和小鼠，所以这里--species参数只设置了这两个物种，其他物种为方便分析可以选择使用gtf模式。脚本的使用说明和参数如下：

Rscript get_annopeak_chipseeker.r --help
Usage: get_annopeak_chipseeker.r [options]

        This script uesed to annotate peak using gtf by ChIPseeker, default format is for macs2 result.

txdb: TxDb.Hsapiens.UCSC.hg38.knownGene, TxDb.Mmusculus.UCSC.mm10.knownGene.

annotation type: Promoter, Exon, Intron, UTR, Downstream, Intergenic.

gene type: feature in gtf, protein_coding, snRNA, et al.

Options:
        -i INPUT, --input=INPUT
                peak file, multiple are separated by commas.

        -e SPECIES, --species=SPECIES
                optional value is human or mouse, [default: NULL].

        -g GTF, --gtf=GTF
                genome gtf, [default: NULL].

        -l LABEL, --label=LABEL
                sample label, multiple are separated by commas, [default: NULL].

        -a ATYPE, --atype=ATYPE
                annotation type for analysis, multiple are separated by commas, [default: NULL].

        -f FTYPE, --ftype=FTYPE
                gene type for analysis, multiple are separated by commas, [default: NULL].

        -u UP, --up=UP
                length of tss upstream, [default: 3000].
        -d DN, --dn=DN
                length of tss downstream, [default: 3000].

        -c CNAME, --cname=CNAME
                colnames from 3td column to end in peak file, [default: 'name,score,stand,signal,pvalue,qvalue,summit'].

        -s, --save
                whether to save annotation txt, [default: FALSE].

        -p PREFIX, --prefix=PREFIX
                the prefix for output, [default: 'chipseeker'].

        -o OUTDIR, --outdir=OUTDIR
                output directory, [default: '.'].

        -h, --help
                Show this help message and exit

大部分参数已经通过蹩脚的英语做了解释，看上去应该可以理解其代表的含义，这里说一下使用时应该注意的地方：--species (ChIPseeker默认注释结果)和--gtf两个参数二选一即可，如果需要基因类型注释必须选择--gtf；--label参数可以给多样本提供样本名，顺序需要跟--input保持一致；默认只保存图片，要保留注释文件的话记得使用--save参数；单个样本运行时注释结果以饼图展示，多个样本一起运行时用条形图展示。
另外，饼图并没有直接ChIPseeker里面的画图函数，想要了解详情的可以戳这里[ChIPseeker绘图函数借用]。下面来演示一下脚本的使用：

Rscript  get_annopeak_chipseeker.r -i sample1_peaks.narrowPeak -g gencode.vM25.annotation.gtf -l sample1 -s

上面命令执行结束会在当前目录生成两个文件chipseeker.sample1.anno.txt、chipseeker.sample1.anno.pdf，分别为注释结果和饼图。

多个样本一起运行时，并选择自己想关注的基因类型和功能区间，生成条形图方便比较多个样本的结果：

Rscript get_annopeak_chipseeker.r -i sample1_peaks.narrowPeak,sample2_peaks.narrowPeak -g gencode.vM25.annotation.gtf -l sample1,sample2 -s -p chipseeker.type -a Promoter,Exon,Intron -f protein_coding,lincRNA,snRNA

上面的命令执行结束会给每个样本生成一个.anno.txt注释结果和一个汇总所有样本的条形图。

往期回顾

linux | while + read 实现本地 or 集群批处理，实用且优雅
 pyscenic | 单细胞转录因子分析，原理图文详解
 一网打尽scRNA矩阵格式读取和转化(h5 h5ad loom)
ggplot2 | 开发自己的画图函数
 R包安装的4种姿势

最后编辑于：2023.09.23 17:04:29

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 194,491评论 5赞 459
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 81,856评论 2赞 371
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 141,745评论 0赞 319
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 52,196评论 1赞 263
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 61,073评论 4赞 355
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 46,112评论 1赞 272
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 36,531评论 3赞 381
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 35,215评论 0赞 253
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 39,485评论 1赞 290
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 34,578评论 2赞 309
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 36,356评论 1赞 326
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 32,215评论 3赞 312
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 37,583评论 3赞 299
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 28,898评论 0赞 17
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 30,174评论 1赞 250
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 41,497评论 2赞 341
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 40,697评论 2赞 335

个性化ChIPseeker注释peak结果，基因 or 功能区类型随心选，好用的脚本等你来取

往期回顾

推荐阅读更多精彩内容