0.数据介绍
这是一篇发表在Cancer Cell 2019 Sep的文章——Single-Cell Transcriptomics in Medulloblastoma Reveals Tumor-Initiating Progenitors and Oncogenic Cascades during Tumorigenesis and Relapse(PMID: 31474569)。本文揭示了OLIG2+细胞在髓母细胞瘤的发生发展起到重要作用,并且会以静息状态,存在于成熟的髓母细胞瘤中,并在复发时被激活,可能是髓母细胞瘤对治疗不敏感的原因之一。OLIG2能够通过表观调控,激活hippo-YAP和AUTOTA-A/MYCN的网络促进MB的生长。
在曾老师推送的基础上,想进一步对文章的数据进行复现。常规转录差异建议都加上一个转录因子数据 (qq.com)
本数据集包含了ChIP-seq,ATAC-seq,RNA-seq。是非常好的练手机会。
背景知识:关于Medulloblastoma (MB):主要是有4个分子亚型:Wingless (WNT), Sonic hedgehog (SHH), Group 3, and Group 4。其中,SHH亚型的信号通路激活主要缘于PTCH1的功能缺失以及SMO的激活,其中GNAS是一种抑癌基因,其功能缺陷与不良预后相关
1.数据预处理
现在SRA网站中获取实验设计的情况。
Run BioProject BioSample Assay Type AvgSpotLen Bases Bytes Experiment GEO_Accession LibraryLayout LibrarySelection LibrarySource Sample Name source_name SRA Study Age geneotype Antibody tissue
1 SRR7984671 PRJNA495715 SAMN10220650 ATAC-seq 75 1.09 G 535.44 Mb SRX4815869 GSM3423042 SINGLE other GENOMIC GSM3423042 medulloblastoma tumor cells SRP165176 Postnatal day 20 hGFAPcre;Pthc/c
2 SRR7984672 PRJNA495715 SAMN10220649 ATAC-seq 75 569.80 M 286.09 Mb SRX4815870 GSM3423043 SINGLE other GENOMIC GSM3423043 medulloblastoma tumor cells SRP165176 Postnatal day 20 hGFAPcre;Pthc/c;Olig2-GFP
3 SRR7984673 PRJNA495714 SAMN10220667 ChIP-Seq 75 1.10 G 665.72 Mb SRX4815871 GSM3423044 SINGLE ChIP GENOMIC GSM3423044 Cerebellum SRP165177 Olig2 (Millipore, ab9610)
4 SRR7984674 PRJNA495714 SAMN10220666 ChIP-Seq 75 2.97 G 1.32 Gb SRX4815872 GSM3423045 SINGLE ChIP GENOMIC GSM3423045 Medulloblastoma tissue SRP165177 Olig2 (Millipore, ab9610)
5 SRR7984675 PRJNA495714 SAMN10220665 ChIP-Seq 75 1.47 G 753.94 Mb SRX4815873 GSM3423046 SINGLE ChIP GENOMIC GSM3423046 Medulloblastoma tissue SRP165177 H3k27ac (Abcam, ab4729)
6 SRR7984676 PRJNA495713 SAMN10220664 RNA-Seq 81 9.18 G 4.98 Gb SRX4815874 GSM3423057 PAIRED cDNA TRANSCRIPTOMIC GSM3423057 medullablastoma tumor cells SRP165179 Postnatal day 24 to30 hGFAPcre;Ptchc/c
7 SRR7984677 PRJNA495712 SAMN10220663 RNA-Seq 150 5.46 G 1.96 Gb SRX4815875 GSM3423058 PAIRED cDNA TRANSCRIPTOMIC GSM3423058 cerebellum SRP165178 Postnatal day 18 Lats1/2f/f
8 SRR7984678 PRJNA495712 SAMN10220662 RNA-Seq 150 4.95 G 1.79 Gb SRX4815876 GSM3423059 PAIRED cDNA TRANSCRIPTOMIC GSM3423059 cerebellum SRP165178 Postnatal day 18 Lats1/2f/f
9 SRR7984679 PRJNA495712 SAMN10220661 RNA-Seq 150 4.82 G 1.77 Gb SRX4815877 GSM3423060 PAIRED cDNA TRANSCRIPTOMIC GSM3423060 cerebellum SRP165178 Postnatal day 18 Lats1/2f/f
10 SRR7984680 PRJNA495712 SAMN10220660 RNA-Seq 150 4.38 G 1.59 Gb SRX4815878 GSM3423061 PAIRED cDNA TRANSCRIPTOMIC GSM3423061 Medullobalstoma SRP165178 Postnatal day 18 Lats1/2f/f; Math1cre
11 SRR7984681 PRJNA495712 SAMN10220646 RNA-Seq 150 4.17 G 1.50 Gb SRX4815879 GSM3423062 PAIRED cDNA TRANSCRIPTOMIC GSM3423062 Medullobalstoma SRP165178 Postnatal day 18 Lats1/2f/f; Math1cre
12 SRR7984682 PRJNA495712 SAMN10220645 RNA-Seq 150 5.06 G 1.84 Gb SRX4815880 GSM3423063 PAIRED cDNA TRANSCRIPTOMIC GSM3423063 Medullobalstoma SRP165178 Postnatal day 18 Lats1/2f/f; Math1cre
13 SRR7984683 PRJNA495712 SAMN10220644 RNA-Seq 76 2.24 G 720.74 Mb SRX4815881 GSM3423064 SINGLE cDNA TRANSCRIPTOMIC GSM3423064 Medullobalstoma SRP165178 Postnatal day 20 Ptchf/f; hGFAPcre
14 SRR7984684 PRJNA495712 SAMN10220643 RNA-Seq 76 2.17 G 700.20 Mb SRX4815882 GSM3423065 SINGLE cDNA TRANSCRIPTOMIC GSM3423065 Medullobalstoma SRP165178 Postnatal day 20 Ptchf/f; hGFAPcre
15 SRR7984685 PRJNA495712 SAMN10220642 RNA-Seq 76 2.16 G 692.55 Mb SRX4815883 GSM3423066 SINGLE cDNA TRANSCRIPTOMIC GSM3423066 Medullobalstoma SRP165178 Postnatal day 20 Ptchf/f; Olig2c/c; hGFAPcre
16 SRR7984686 PRJNA495712 SAMN10220641 RNA-Seq 76 2.21 G 710.51 Mb SRX4815884 GSM3423067 SINGLE cDNA TRANSCRIPTOMIC GSM3423067 Medullobalstoma SRP165178 Postnatal day 20 Ptchf/f; Olig2c/c; hGFAPcre
17 SRR8037270 PRJNA495592 SAMN10228087 RNA-Seq 132 3.16 G 1.87 Gb SRX4867885 GSM3423041 PAIRED cDNA TRANSCRIPTOMIC GSM3423041 medullablastoma tumor cells SRP165234 Postnatal day 10 hGFAPcre;Ptchc/c
18 SRR8037271 PRJNA495592 SAMN10228087 RNA-Seq 132 2.85 G 1.69 Gb SRX4867885 GSM3423041 PAIRED cDNA TRANSCRIPTOMIC GSM3423041 medullablastoma tumor cells SRP165234 Postnatal day 10 hGFAPcre;Ptchc/c
19 SRR8037272 PRJNA495592 SAMN10228087 RNA-Seq 132 4.23 G 2.48 Gb SRX4867885 GSM3423041 PAIRED cDNA TRANSCRIPTOMIC GSM3423041 medullablastoma tumor cells SRP165234 Postnatal day 10 hGFAPcre;Ptchc/c
20 SRR8037273 PRJNA495592 SAMN10228087 RNA-Seq 132 2.80 G 1.67 Gb SRX4867885 GSM3423041 PAIRED cDNA TRANSCRIPTOMIC GSM3423041 medullablastoma tumor cells SRP165234 Postnatal day 10 hGFAPcre;Ptchc/c
21 SRR9667717 PRJNA495714 SAMN12249854 ChIP-Seq 75 2.34 G 1.08 Gb SRX6428312 GSM3936983 SINGLE ChIP GENOMIC GSM3936983 Cerebellum SRP165177 Olig2 (Millipore, ab9610)
22 SRR9667718 PRJNA495714 SAMN12249853 ChIP-Seq 75 1.98 G 896.85 Mb SRX6428313 GSM3936984 SINGLE ChIP GENOMIC GSM3936984 Medulloblastoma tissue SRP165177 Olig2 (Millipore, ab9610)
23 SRR9667719 PRJNA495714 SAMN12249852 ChIP-Seq 75 1.82 G 838.77 Mb SRX6428308 GSM3936985 SINGLE ChIP GENOMIC GSM3936985 Medulloblastoma tissue SRP165177 Olig2 (Millipore, ab9610)
24 SRR9667720 PRJNA495714 SAMN12249850 ChIP-Seq 75 2.02 G 924.48 Mb SRX6428309 GSM3936986 SINGLE ChIP GENOMIC GSM3936986 Medulloblastoma tissue SRP165177 H3k27ac (Abcam, ab4729)
25 SRR9667721 PRJNA495714 SAMN12249849 ChIP-Seq 75 1.10 G 665.72 Mb SRX6428310 GSM3936987 SINGLE ChIP GENOMIC GSM3936987 Cerebellum SRP165177 none
26 SRR9667722 PRJNA495714 SAMN12249847 ChIP-Seq 75 2.64 G 1.16 Gb SRX6428311 GSM3936988 SINGLE ChIP GENOMIC GSM3936988 Medulloblastoma tissue SRP165177 none
27 SRR9667723 PRJNA495713 SAMN12249846 RNA-Seq 83 9.24 G 3.78 Gb SRX6428314 GSM3936989 PAIRED cDNA TRANSCRIPTOMIC GSM3936989 medullablastoma tumor cells SRP165179 Postnatal day 12 hGFAPcre;Ptchc/c
28 SRR9667724 PRJNA495712 SAMN12249845 RNA-Seq 75 1.74 G 830.27 Mb SRX6428315 GSM3936990 SINGLE cDNA TRANSCRIPTOMIC GSM3936990 GFP+ cells SRP165178 Postnatal day 15 hGFAPcre;Ptchc/c;Olig2-GFP
29 SRR9667725 PRJNA495712 SAMN12249844 RNA-Seq 75 1.72 G 826.82 Mb SRX6428316 GSM3936991 SINGLE cDNA TRANSCRIPTOMIC GSM3936991 GFP+ cells SRP165178 Postnatal day 15 hGFAPcre;Ptchc/c;Olig2-GFP
30 SRR9667726 PRJNA495712 SAMN12249843 RNA-Seq 75 1.73 G 832.94 Mb SRX6428317 GSM3936992 SINGLE cDNA TRANSCRIPTOMIC GSM3936992 GFP+ cells SRP165178 Postnatal day 15 hGFAPcre;Ptchc/c;Olig2-GFP
31 SRR9667727 PRJNA495712 SAMN12249841 RNA-Seq 75 1.63 G 787.67 Mb SRX6428318 GSM3936993 SINGLE cDNA TRANSCRIPTOMIC GSM3936993 GFP- cells SRP165178 Postnatal day 15 hGFAPcre;Ptchc/c;Olig2-GFP
32 SRR9667728 PRJNA495712 SAMN12249840 RNA-Seq 75 1.63 G 782.61 Mb SRX6428319 GSM3936994 SINGLE cDNA TRANSCRIPTOMIC GSM3936994 GFP- cells SRP165178 Postnatal day 15 hGFAPcre;Ptchc/c;Olig2-GFP
33 SRR9667729 PRJNA495712 SAMN12249838 RNA-Seq 75 1.79 G 861.33 Mb SRX6428320 GSM3936995 SINGLE cDNA TRANSCRIPTOMIC GSM3936995 GFP- cells SRP165178 Postnatal day 15 hGFAPcre;Ptchc/c;Olig2-GFP
1.1 原始文件下载
mkdir SRRfile
prefetch --option-file SRR_Acc_List.txt
1.2 使用fasterq进行解压缩
ls .| while read id;do(fasterq-dump $id);done
mkdir ../fastq
mv *.fastq ../fastq
忘记设置output的地址了,这下得把所有的fastq文件重新整理一下
cd fastq
cat ../SRR_Acc_List.txt | while read id;do(mkdir $id);done
1.3 测序数据的质控
cat SRR_Acc_List.txt | while read id;do(
fastqc -o ./fastqc -t 20 ./fastq/$id/*.fastq \
&& multiqc ./fastqc -o ./fastqc);
done
1.4 测序数据分类整理
由于原始数据涉及了不同的文件类型,所以需要相应进行整理。在总文件夹中写入不同数据类型所对应的SRR序号。shell脚本不是特别熟练,因此就在单机上,使用excel进行筛选了。
# vi chip.txt
SRR7984673
SRR7984674
SRR7984675
SRR9667717
SRR9667718
SRR9667719
SRR9667720
SRR9667721
SRR9667722
# vi rna.txt
SRR7984676
SRR7984677
SRR7984678
SRR7984679
SRR7984680
SRR7984681
SRR7984682
SRR7984683
SRR7984684
SRR7984685
SRR7984686
SRR8037270
SRR8037271
SRR8037272
SRR8037273
SRR9667723
SRR9667724
SRR9667725
SRR9667726
SRR9667727
SRR9667728
SRR9667729
# vi atac.txt
SRR7984671
SRR7984672
RNA-seq共有22个测序文件;ATAC-seq共有2个测序文件;ChIP-seq共有9个测序文件。
2.ChIP-seq数据处理
2.1 使用trim_galore自动去接头
这个数据貌似已经经过质控了,因此这步貌似可做可不做。G没有去接头时,比对上的比例还是能够达到95%以上,还是不错的。
mkdir clean
cat chip.txt | while read id;do\
trim_galore --quality 20 --phred33 \
--length 20 -j 20 \
-o ./clean/$id.fastq.gz \
./fastq/$id/$id.fastq;
done
2.2 使用bowtie2进行比对
mkdir aligned
cat chip.txt | while read id;do \
bowtie2 -p 20 -3 5 --local -x ~/mm10/mm10 -U ./fastq/$id/$id.fastq | \
samtools sort -O bam -o ./aligned/$id.bam; \
done
比对过程需要花很多时间。
# SRR7984673
# 14675638 reads; of these:
# 14675638 (100.00%) were unpaired; of these:
# 339654 (2.31%) aligned 0 times
# 8731707 (59.50%) aligned exactly 1 time
# 5604277 (38.19%) aligned >1 times
# 97.69% overall alignment rate
# [bam_sort_core] merging from 4 files and 1 in-memory blocks...
......
2.3 使用sambamba去重
比对的效率总体还是不错的。比对之后进行PCR去重。一开始选择了picard工具进行去重。
mkdir deduplicate
cat atac.txt |while read id
do(picard MarkDuplicates --REMOVE_DUPLICATES true -I aligned/${id}.bam \
-O ./deduplicate/${id}_deduplicate.bam -M ${id}.log);
done
然而占用的线程过多,为了避免被管理员拉黑,选择其他的去重方法sambamba
mkdir deduplicate
cat chip.txt |while read id;do
sambamba markdup -r aligned/${id}.bam deduplicate/${id}_deduplicate.bam;
done
2.4 使用MACS2 call peak
在call peak之前,需要对明确对照组和实验组,根据文章的介绍,对GSM序号和SRR序号进行配对。具体的对应关系如下所示:
SRR7984673 GSM3423044 ChIP_CB_Olig2
SRR7984674 GSM3423045 ChIP_MB_Olig2
SRR7984675 GSM3423046 ChIP_MB_H3k27ac
SRR9667717 GSM3936983 ChIP_CB_H3k27ac
SRR9667718 GSM3936984 ChIP_MB_Olig2_repeat 2
SRR9667719 GSM3936985 ChIP_MB_Olig2_repeat 3
SRR9667720 GSM3936986 ChIP_MB_H3k27ac_repeat 2
SRR9667721 GSM3936987 CB_Input
SRR9667722 GSM3936988 MB_Input
因为涉及比较的组数不多,所以可以手动进行设置,如需批量操作,可以自行构建比较文档。
MACS2的命令比较简单,主要就是确定好对照组和实验组。
mkdir macs2
# MB_H3K27ac_1
macs2 callpeak -c ./deduplicate/SRR9667722_deduplicate.bam \
-t ./deduplicate/SRR7984675_deduplicate.bam \
-q 0.05 -f BAM -g mm --outdir macs2 \
-n MB_H3K27ac_1_
# MB_H3K27ac_2
macs2 callpeak -c ./deduplicate/SRR9667722_deduplicate.bam \
-t ./deduplicate/SRR9667720_deduplicate.bam \
-q 0.05 -f BAM -g mm --outdir macs2 \
-n MB_H3K27ac_2_
# MB_OLIG2_1
macs2 callpeak -c ./deduplicate/SRR9667722_deduplicate.bam \
-t ./deduplicate/SRR7984674_deduplicate.bam \
-q 0.05 -f BAM -g mm --outdir macs2 \
-n MB_OLIG2_1_
# MB_OLIG2_2
macs2 callpeak -c ./deduplicate/SRR9667722_deduplicate.bam \
-t ./deduplicate/SRR9667718_deduplicate.bam \
-q 0.05 -f BAM -g mm --outdir macs2 \
-n MB_OLIG_2_
# MB_OLIG2_3
macs2 callpeak -c ./deduplicate/SRR9667722_deduplicate.bam \
-t ./deduplicate/SRR9667719_deduplicate.bam \
-q 0.05 -f BAM -g mm --outdir macs2 \
-n MB_OLIG_3_
# CB_OLIG2_1
macs2 callpeak -c ./deduplicate/SRR9667721_deduplicate.bam \
-t ./deduplicate/SRR7984673_deduplicate.bam \
-q 0.05 -f BAM -g mm --outdir macs2 \
-n CB_OLIG_1_
# CB_H3K27ac_1
macs2 callpeak -c ./deduplicate/SRR9667721_deduplicate.bam \
-t ./deduplicate/SRR9667717_deduplicate.bam \
-q 0.05 -f BAM -g mm --outdir macs2 \
-n CB_H3K27ac_1_
每次比较都会得到四个文件:NAMEpeaks.xls,NAMEsummits.bed,NAME_model.r,NAMEpeaks.narrowPeak
NAMEpeaks.xls:peak信息,和bed格式类似,但是以1为基,而bed文件是以0为基.也就是说xls的坐标都要减一才是bed文件的坐标
NAMEpeaks.narrowPeak:后面4列表示为, integer score for display, fold-change,-log10pvalue,-log10qvalue,relative summit position to peak start。内容和NAMEpeaks.xls基本一致,适合用于导入R进行分析。
NAMEsummits.bed:记录每个peak的peak summits,也就是记录极值点的位置。MACS建议用该文件寻找结合位点的motif。
NAME_model.r:能通过NAME_model.r作图,得到是基于你提供数据的peak模型
2.5 deeptools可视化
peak的可视化有多种软件可供选择,包括deeptools。也可以将数据导入R环境中,用Y叔的CHIPseeker包进行可视化。
这是原文中Figure6F-H的数据。F图是通过热图展示OLIG2 binding的位点上,也往往有高强度的H3K27ac的信号,提示着一种OLIG2在增强相关基因表达的机制。G图是显示在MB肿瘤中,OLIG2和H3K27ac的结合情况。H图则是针对结合的位点进行motif的分析,预测最有可能对目标基因进行调控的转录因子。
原文中对方法学的描述是这样的:
All sequencing data were mapped to mouse genome assembly mm10, and ChIP-seq peak calling was performed as previously described using model-based analysis of ChIP-seq (MACS, version 1.4.2; http://liulab.dfci.harvard.edu/MACS) with default parameters to get primary binding regions. To ensure that our data were of high quality and reproducibility, we called peaks with enrichment R10-fold over control (p % 10-9) and compared the peak sets using the ENCODE overlap rules. The identified primary regions were further filtered using the following criteria, to define a more stringent protein-DNA interactome: (1) the p value cutoff was set to %10-9; and (2) we required an enrichment of six-fold and peak height >5. The genome-wide distribution of protein-binding regions was determined by HOMER (http://homer.salk.edu/homer/index.html) in reference to UCSC mm10. For all ChIP-seq data sets, WIG files were generated with MACS, which were subsequently visualized using Mochiview v1.46. Occupancy was analyzed with Pearson’s correlation and ToppCluster (https://toppcluster.cchmc.org/). ChIP-seq heat maps were ordered by strength of binding. The heat maps were drawn using the Heatmap tools provided by Cistrome (http://cistrome.org/ap).
以下开始对这些图进行复现。
#首先将bam转换为bw文件
effective_genome_size=1870000000
cat chip.txt |while read id;do
bamCoverage -p 25 --bam ./deduplicate/${id}_deduplicate.bam -o ./bw/${id}.bw \
--binSize 10 --normalizeUsing RPGC --effectiveGenomeSize ${effective_genome_size}
done
F图
# conda install deeptools
mkdir deeptools
computeMatrix reference-point --referencePoint TSS -p 20\
-b 5000 -a 5000 -R ~/mm10.tss.bed \
-S ./bw/SRR7984673.bw ./bw/SRR7984674.bw ./bw/SRR7984675.bw \
--skipZeros -o ./deeptools/F.gz \
--outFileSortedRegions ./deeptools/F_Heatmap1sortedRegions.bed
cd deeptools
plotHeatmap -m F.gz -out F.TSS.Heatmap.pdf --plotFileFormat pdf
plotProfile -m F.gz -out F.TSS.Profile.pdf --plotFileFormat pdf --perGroup
cd ..
G图
computeMatrix reference-point --referencePoint TSS -p 20\
-b 5000 -a 5000 -R ~/mm10.tss.bed \
-S ./bw/SRR7984674.bw ./bw/SRR7984675.bw \
--skipZeros -o ./deeptools/G.gz \
--outFileSortedRegions ./deeptools/G_Heatmap1sortedRegions.bed
cd deeptools
plotHeatmap -m G.gz -out G.TSS.Heatmap.pdf --plotFileFormat pdf
plotProfile -m G.gz -out G.TSS.Profile.pdf --plotFileFormat pdf --perGroup
仍然与原文中存在一些差异。这里TSS的范围没有按照原文-±400b的范围进行可视化,另外,需要对识别到的peak进行相应的筛选。
2.6 homer寻找motif
阅读原文的方法学部分,作者通过homer进行motif的分析。下面对相关motif进行复现
本文是针对MB中OLIG2结合位点进行motif分析,因此,我们通过macs2获得的peak,对特征性的motif进行识别。
awk '{print $4"t"$1"t"$2"t"$3"t "}' ./macs2/MB_OLIG2_1__peaks.narrowPeak > ./macs2/MB_OLIG2_1.homer_peaks.tmp
# 寻找motif
mkdir homer
findMotifsGenome.pl ./macs2/MB_OLIG2_1__summits.bed ~/mm10.fa ./homer/MB_OLIG2_1.homer.motif -size 200 -len 8,10,12
针对结合的DNA序列,会返回很多识别到的motif序列。但貌似跟原文展示的最显著的都不太一样。
2.7 DiffBind找差异peak
由于本文中涉及多组进行比较,因此还可以通过DiffBind的方法寻找差异peak。
3.ATAC-seq数据处理
3.1 使用bowtie2进行序列比对
ATAC与ChIP类似,还是需要先将测序的片段比对到基因组上去
cat atac.txt | while read id;do \
bowtie2 -p 25 -3 5 --local -x ~/mm10/mm10 -U ./fastq/$id/$id.fastq | \
samtools sort -O bam -o ./aligned/$id.bam; \
done
#14541510 reads; of these:
# 14541510 (100.00%) were unpaired; of these:
# 171629 (1.18%) aligned 0 times
# 9336347 (64.20%) aligned exactly 1 time
# 5033534 (34.61%) aligned >1 times
#98.82% overall alignment rate
#[bam_sort_core] merging from 4 files and 1 in-memory blocks...
#7597315 reads; of these:
# 7597315 (100.00%) were unpaired; of these:
# 95358 (1.26%) aligned 0 times
# 4675659 (61.54%) aligned exactly 1 time
# 2826298 (37.20%) aligned >1 times
#98.74% overall alignment rate
[bam_sort_core] merging from 2 files and 1 in-memory blocks...
注意,这里得到的bam文件是已经排过序的bam文件
3.2 去除PCR重复
同ChIP。
# mkdir deduplicate
cat atac.txt |while read id;do
sambamba markdup -r aligned/${id}.bam deduplicate/${id}_deduplicate.bam;
done
3.3 去重之后使用macs2进行call peak
同ChIP。
macs2 callpeak -c ./deduplicate/SRR7984671_deduplicate.bam \
-t ./deduplicate/SRR7984672_deduplicate.bam \
-q 0.05 -f BAM -g mm --outdir macs2 \
--nomodel --shift -100 --extsize 200 -n ATAC_
这步之后也可以获得narrowPeak文件和bed文件。
3.4 获取bw文件
前面的bam文件,narrowPeak文件,bed文件都有了,现在还差一个bw文件
bw文件可以获得测序的覆盖度
# mkdir bw
effective_genome_size=1870000000
cat atac.txt |while read id;do
bamCoverage -p 20 --bam ./deduplicate/${id}_deduplicate.bam -o ./bw/${id}.bw \
--binSize 10 --normalizeUsing RPGC --effectiveGenomeSize ${effective_genome_size}
done
这是原文中根据ATAC-seq展现的不同基因位点染色质开放程度的差异。
导出bw文件,在igv中挑选了几个基因进行了可视化,的确可以看到OLIG2+组中,染色质的开放程度是更高的
3.5 使用deeptools可视化
# mkdir deeptools
computeMatrix reference-point --referencePoint TSS -p 20\
-b 5000 -a 5000 -R ~/mm10.tss.bed \
-S ./bw/SRR7984671.bw ./bw/SRR7984672.bw \
--skipZeros -o ./deeptools/ATAC_TSS.gz \
--outFileSortedRegions ./deeptools/ATAC_Heatmap1sortedRegions.bed
cd deeptools
plotHeatmap -m ATAC_TSS.gz -out TSS.Heatmap.pdf --plotFileFormat pdf
plotProfile -m ATAC_TSS.gz -out TSS.Profile.pdf --plotFileFormat pdf --perGroup
不同的参数所映射的含义也不相同,可以通过这张图进行了解(来源:plob.org)。
从本数据中得到的Profile图中,也可以明显看到Olig2+的样本中染色体的可及性更高,与上述结论一致。
4.RNA-seq数据处理
RNA-seq数据的处理和分析已经比较普及了,在此先略过。
5.写在后面
虽然能把整个pepline顺过来,但是对于部分参数的理解还是不够深入,仍然需要对相关生物学背景和软件有更深的认识。感谢生信技能树搭建这个生信爱好者的平台,以及Jimmy大神无私分享的这么多资料,让生信这条学习路线不再陡峭。与大家一起进步。