Bedtools笔记

BedTools 笔记

工具目的:探索、处理和操作基因间隔文件(e.g., BED, VCF, BAM)。

学习Tutorial

为了学习,先创建工作目录,下载相应的demo文件。

intersect

如果使用默认,intersect会报道两个文件overlap区间。(可能已经与源文件不同啦)

The -wa (write A) and -wb (write B) options allow one to see the original records from the A and B files that overlapped. 如果是用该参数,会返回产生overlap的原AB序列。

The -wo (write overlap) option allows one to also report the number of base pairs of overlap between the features that overlap between each of the files. 同时返回交叉了的碱基对的数量。

We can also count, for each feature in the “A” file, the number of overlapping features in the “B” file. This is handled with the -c option. 相同的Afile特征会对应几个Bfile特征。

Often we want to identify those features in our A file that do not overlap features in the B file. The -v option is your friend in this case.A中区间没有匹配到B中区间返回。

Recall that the default is to report overlaps between features in A and B so long as at least one basepair of overlap exists. However, the -f option allows you to specify what fraction of each feature in A should be overlapped by a feature in B before it is reported.设定阈值,需要多少才能算交叉了。可以取50% -f 0.50

排序的数据可以进行更快速的分析。So far the examples presented have used the traditional algorithm in bedtools for finding intersections. It turns out, however, that bedtools is much faster when using presorted data.NOTE: While the run times in this example are quite small, the performance gains from using the -sorted option groqw as datasets grow larger. For example, compare the runtimes of the sorted and unsorted approaches as a function of dataset size in the figure below. The important thing to remember is that each dataset must be sorted by chromosome and then by start position: sort -k1,1 -k2,2n.-

As of version 2.21.0, bedtools is able to intersect an “A” file against one or more “B” files. This greatly simplifies analyses involving multiple datasets relevant to a given experiment. For example, let’s intersect exons with CpG islands, GWAS SNPs, an the ChromHMM annotations.支持单个数据集与多个数据集的比较分析。

Now by default, this isn’t incredibly informative as we can’t tell which of the three “B” files yielded the intersection with each exon. However, if we use the -wa and wb options, we
can see from which file number (following the order of the files given on the command line) the intersection came. In this case, the 7th column reflects this file number.

Additionally, one can use file “labels” instead of file numbers to facilitate interpretation, especially when there are many files involved. 用-names可以进行标记

bedtools intersect -a exons.bed -b cpg.bed gwas.bed hesc.chromHmm.bed -sorted -wa -wb -names cpg gwas chromhmm \
  | head -10000 \
  | tail -10

merge

Many datasets of genomic features have many individual features that overlap one another (e.g. aligments from a ChiP seq experiment). It is often useful to just cobine the overlapping into a single, contiguous interval. The bedtools merge command will do this for you. 有点像取并集。

The merge tool requires that the input file is sorted by chromosome, then by start position. This allows the merging algorithm to work very quickly without requiring any RAM.数据集必须先排序。

If your files are unsorted, the merge tool will raise an error. To correct this, you need to sort your BED using the UNIX sort utility. For example:

sort -k1,1 -k2,2n foo.bed > foo.sort.bed
Merging results in a new set of intervals representing the merged set of intervals in the input. That is, if a base pair in the genome is covered by 10 features, it will now only be represented once in the output file.

bedtools merge -i exons.bed | head -n 20
chr1    11873   12227
chr1    12612   12721
chr1    13220   14829
chr1    14969   15038
chr1    15795   15947
chr1    16606   16765
chr1    16857   17055
chr1    17232   17368
chr1    17605   17742
chr1    17914   18061
chr1    18267   18366
chr1    24737   24891
chr1    29320   29370
chr1    34610   35174
chr1    35276   35481
chr1    35720   36081
chr1    69090   70008
chr1    134772  139696
chr1    139789  139847
chr1    140074  140566

一个碱基出现多次,只显示一次。

A more sophisticated approach would be to not only merge overlapping intervals, but also report the number of intervals that were integrated into the new, merged interval. One does this with the -c and -o options. The -c option allows one to specify a column or columns in the input that you wish to summarize. The -o option defines the operation(s) that you wish to apply to each column listed for the -c option. For example, to count the number of overlapping intervals that led to each of the new “merged” intervals, one will “count” the first column (though the second, third, fourth, etc. would work just fine as well).

bedtools merge -i exons.bed -c 1 -o count | head -n 20
chr1    11873   12227   1
chr1    12612   12721   1
chr1    13220   14829   2
chr1    14969   15038   1
chr1    15795   15947   1

这个可以计算重叠区间的个数。

With the -d (distance) option, one can also merge intervals that do not overlap, yet are close to one another. 设定一个阈值,在其内时,可以将两段分隔序列连在一起(merge)。

Many times you want to keep track of the details of exactly which intervals were merged. One way to do this is to create a list of the names of each feature. We can do with with the collapse operation available via the -o argument. The name of the exon is in the fourth column, so we ask merge to create a list of the exon names with -c 4 -o collapse:

bedtools merge -i exons.bed -d 90 -c 1,4 -o count,collapse | head -20
chr1    11873   12227   1   NR_046018_exon_0_0_chr1_11874_f
chr1    12612   12721   1   NR_046018_exon_1_0_chr1_12613_f
chr1    13220   14829   2   NR_046018_exon_2_0_chr1_13221_f,NR_024540_exon_0_0_chr1_14362_r
chr1    14969   15038   1   NR_024540_exon_1_0_chr1_14970_r
chr1    15795   15947   1   NR_024540_exon_2_0_chr1_15796_r
chr1    16606   16765   1   NR_024540_exon_3_0_chr1_16607_r

这个很容易理解,把重叠的标签都显示出来了。

Complement

We often want to know which intervals of the genome are NOT “covered” by intervals in a given feature file. For example, if you have a set of ChIP-seq peaks, you may also want to know which regions of the genome are not bound by the factor you assayed. The complement addresses this task. 在你的序列中而不在参考序列中的区间。

bedtools complement -i exons.bed -g genome.txt 

genomecov

For many analyses, one wants to measure the genome wide coverage of a feature file. For example, we often want to know what fraction of the genome is covered by 1 feature, 2 features, 3 features, etc. This is frequently crucial when assessing the “uniformity” of coverage from whole-genome sequencing. This is done with the versatile genomecov tool.

As an example, let’s produce a histogram of coverage of the exons throughout the genome. Like the merge tool, genomecov requires pre-sorted data. It also needs a genome file as above. 略略有点复杂,参数也可以设置很多,看原手册吧,有图解释。

bedtools genomecov -i exons.bed -g genome.txt

This should run for 3 minutes or so. At the end of your output, you should see something like:

genome  0   3062406951  3137161264  0.976171
genome  1   44120515    3137161264  0.0140638
genome  2   15076446    3137161264  0.00480576
genome  3   7294047 3137161264  0.00232505
genome  4   3650324 3137161264  0.00116358
genome  5   1926397 3137161264  0.000614057
genome  6   1182623 3137161264  0.000376972
genome  7   574102  3137161264  0.000183
genome  8   353352  3137161264  0.000112634

BEDGRAPH output

Using the -bg option, one can also produce BEDGRAPH output which represents the “depth” fo feature coverage for each base pair in the genome:

bedtools genomecov -i exons.bed -g genome.txt -bg | head -20
chr1    11873   12227   1
chr1    12612   12721   1
chr1    13220   14361   1
chr1    14361   14409   2
chr1    14409   14829   1
chr1    14969   15038   1
chr1    15795   15947   1

Sophistication through chaining multiple bedtools

Analytical power in bedtools comes from the ability to “chain” together multiple tools in order to construct rather sophisicated analyses with very little programming - you just need genome arithmetic! Have a look at the examples here.

Here are a few more examples.

  1. Identify the portions of intended capture intervals that did not have any coverage:

@brent_p bedtools genomecov -ibam aln.bam -bga | awk ‘$4==0’ | | bedtools intersect -a regions -b - > foo

— Aaron Quinlan (

@aaronquinlan

)

January 10, 2014

  1. Assessing the breadth and depth coverage of sequencing coverage in exome studies.

Principal component analysis

We will use the bedtools implementation of a Jaccard statistic to meaure the similarity of two datasets. Briefly, the Jaccard statistic measures the ratio of the number of intersecting base pairs to the total number of base pairs in the two sets. As such, the score ranges from 0.0 to 1. 0; lower values reflect lower similarity, whereas higher values reflect higher similarity.

用一个叫做Jaccard统计的方法取计算两个数据集的相似度。

Let’s walk through an example: we would expect the Dnase hypersensivity sites to be rather similar between two samples of the same fetal tissue type. Let’s test:

bedtools jaccard \
    -a fHeart-DS16621.hotspot.twopass.fdr0.05.merge.bed \
    -b fHeart-DS15839.hotspot.twopass.fdr0.05.merge.bed
intersection    union-intersection  jaccard n_intersections
81269248    160493950   0.50637 130852

But what about the similarity of two different tissue types?

bedtools jaccard \
    -a fHeart-DS16621.hotspot.twopass.fdr0.05.merge.bed \
    -b fSkin_fibro_bicep_R-DS19745.hg19.hotspot.twopass.fdr0.05.merge.bed
intersection    union-intersection  jaccard n_intersections
28076951    164197278   0.170995    73261

Hopefully this demonstrates how the Jaccard statistic can be used as a simple statistic to reduce the dimensionality of the comparison between two large (e.g., often containing thousands or millions of intervals) feature sets.

A Jaccard statistic for all 400 pairwise comparisons.

We are going to take this a bit further and use the Jaccard statistic to measure the similarity of all 20 tissue samples against all other 20 samples. Once we have a 20x20 matrix of similarities, we can use dimensionality reduction techniques such as hierarchical clustering or principal component analysis to detect higher order similarities among all of the datasets.

We will use GNU parallel to compute a Jaccard statistic for the 400 (20*20) pairwise comparisons among the fetal tissue samples. 这里进行多数据集的非独立比较,涉及到并行计算。

But first, we need to install GNU parallel.

brew install parallel

Next, we need to install a tiny script I wrote for this analysis.

curl -O https://s3.amazonaws.com/bedtools-tutorials/web/make-matrix.py

Now, we can use parallel to, you guessed it, compute the 400 pairwise Jaccard statistics in parallel using as many processors as you have available.

parallel "bedtools jaccard -a {1} -b {2} \
         | awk 'NR>1' \
         | cut -f 3 \
         > {1}.{2}.jaccard" \
         ::: `ls *.merge.bed` ::: `ls *.merge.bed`

This command will create a single file containing the pairwise Jaccard measurements from all 400 tests.

find . \
    | grep jaccard \
    | xargs grep "" \
    | sed -e s"/\.\///" \
    | perl -pi -e "s/.bed./.bed\t/" \
    | perl -pi -e "s/.jaccard:/\t/" \
    > pairwise.dnase.txt

A bit of cleanup to use more intelligible names for each of the samples.

cat pairwise.dnase.txt \
| sed -e 's/.hotspot.twopass.fdr0.05.merge.bed//g' \
| sed -e 's/.hg19//g' \
> pairwise.dnase.shortnames.txt

Now let’s make a 20x20 matrix of the Jaccard statistic. This will allow the data to play nicely with R.

awk 'NF==3' pairwise.dnase.shortnames.txt \
| awk '$1 ~ /^f/ && $2 ~ /^f/' \
| python make-matrix.py \
> dnase.shortnames.distance.matrix

Let’s also make a file of labels for each dataset so that we can label each dataset in our R plot.

cut -f 1 dnase.shortnames.distance.matrix | cut -f 1 -d "-" | cut -f 1 -d "_" > labels.txt

Now start up R.

R

NOTE: The following example assumes that you have both the ggplot2 and RColorBrewer packages installed on your computer. If they are not installed, run both install.packages("ggplot2") and install.packages("RColorBrewer") from the R prompt and respond to the prompts that will follow.-

You should see something very similar to this:

R version 2.15.1 (2012-06-22) -- "Roasted Marshmallows"
Copyright (C) 2012 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-apple-darwin12.0.0 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

>

No paste these commands into the R console:

library(ggplot2)
library(RColorBrewer)
blues <- colorRampPalette(c('dark blue', 'light blue'))
greens <- colorRampPalette(c('dark green', 'light green'))
reds <- colorRampPalette(c('pink', 'dark red'))
 
x <- read.table('dnase.shortnames.distance.matrix')
labels <- read.table('labels.txt')
ngroups <- length(unique(labels))
pca <- princomp(x)

pdf('pca.pdf')
qplot(pca$scores[,1], pca$scores[,2], color=labels[,1],     geom="point", size=1) +
  scale_color_manual(values = c(blues(4), greens(5), reds(5))) 
dev.off()

You should see this:

img
img

Et voila.

Note that PCA was used in this case as a toy example of what PCA does for the CSHL Adv. Seq. course. Heatmaps are a more informative visualization in this case since Jaccard inherently returns a measure of distance.

So let’s make a heatmap for giggles.

NOTE: The following example assumes that you have both the gplots package installed on your computer. If it are not installed, run install.packages("gplots") from the R prompt and respond to the prompts that will follow.-

library(gplots)
library(RColorBrewer)
jaccard_table <- x[, -1]
jaccard_matrix <- as.matrix(jaccard_table)

pdf('heat.pdf')
heatmap.2(jaccard_matrix, col = brewer.pal(9,"Blues"), margins = c(14, 14), density.info = "none", lhei=c(2, 8), trace= "none")
dev.off()

You should see this:

img
img

Puzzles to help teach you more bedtools.

  1. Create a BED file representing all of the intervals in the genome that are NOT exonic.
  2. What is the average distance from GWAS SNPs to the closest exon? (Hint - have a look at the closest tool.)
  3. Count how many exons occur in each 500kb interval (“window”) in the human genome. (Hint - have a look at the makewindows tool.)
  4. Are there any exons that are completely overlapped by an enhancer? If so, how many?
  5. What fraction of the GWAS SNPs are exonic?
  6. What fraction of the GWAS SNPs are lie in either enhancers or promoters in the hESC data we have?
  7. Create intervals representing the canonical 2bp splice sites on either side of each exon (don’t worry about excluding splice sites at the first or last exon). (Hint - have a look at the flank tool.)
  8. What is the Jaccard statistic between CpG and hESC enhancers? Compare that to the Jaccard statistic between CpG and hESC promoters. Does the result make sense? (Hint - you will need grep).
  9. What would you expect the Jaccard statistic to look like if promoters were randomly distributed throughout the genome? (Hint - you will need the shuffle tool.)
  10. Which hESC ChromHMM state (e.g., 11_Weak_Txn, 10_Txn_Elongation) represents the most number of base pairs in the genome? (Hint: you will need to use awk or perl here, as well as the groupby tool.)

answers

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 206,723评论 6 481
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 88,485评论 2 382
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 152,998评论 0 344
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 55,323评论 1 279
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 64,355评论 5 374
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 49,079评论 1 285
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 38,389评论 3 400
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,019评论 0 259
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 43,519评论 1 300
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,971评论 2 325
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,100评论 1 333
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,738评论 4 324
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 39,293评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,289评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,517评论 1 262
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 45,547评论 2 354
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,834评论 2 345

推荐阅读更多精彩内容

  • 成功不是随机事件,成功是一系列可预知的、强而有力的优势环境和机遇构成!
    严老帅阅读 165评论 0 0
  • 文/海芬 垄上野菊香 银杏叶儿黄 冬寒步履欢 健身心舒畅 迎着冬日阳光 徒步乡村道上 银杏野菊金灿 阳光暖照心房
    海语天籁阅读 462评论 1 8
  • 青石板上的月光照进这山城 泥泞路上的大雨浇灌这情绪 唱一唱 那夜星空灿烂 写一写 那时满怀热爱 念一念 这刻想你甚...
    CX裂缝中的阳光阅读 158评论 0 0