Cut&tag数据分析流程
1.数据为clean版本,对数据为进行质控
reads 开始时序列内容不一致是 CUT&Tag reads 中常见的现象。 没有通过每个基本的 sequnence 内容不意味着你的数据失败。
可能由于是 Tn5 的偏好性
您可能检测到的是 10 bp 的周期性,该周期性在长度分布中显示为锯齿状。如果是这样,这是正常现象,不会影响校准或 Peak calling。无论如何,我们都不建议您修剪 reads,因为我们列出的 bowtie2 参数将提供准确的 mapping 信息而无需修剪。
2.用 bowtie2进行比对
参考基因组索引建立 采用华中农大SWO.v3.0.genome.fa 华中农大参考基因组
比对参数 bowtie2 --end-to-end --very-sensitive --no-mixed --no-discordant --phred33 -I 10 比对部分参考文章bowtie2比对
3.比对的sam文件转换成bam文件 samtools工具
命令:$ samtools view -bS abc.sam > abc.bam
sam文件如图
bam文件为二进制文件但运行速度快,方便后续homer进行peakcalling.
4.homer软件进行peakcalling 步骤:
4.1 比对基因组得到bam文件之后,首先用通过makeTagDirectory这个命令,生成一个文件夹,用法如下 详细参考
makeTagDirectory out_dir align.bam
findPeaks out_dir / -style factor -o homer.peak.bed (https://www.jianshu.com/p/2e0c884d13d0)
#完成peak calling 数据保存在bed文件内 和txt文件内
5.custom motif 分析
通过homer findMotifsGenome.pl 脚本进行基因组 motif预测 代码为 findMotifsGenome.pl peaks.txt ALF.fasta OutputResults/
homer的findpeaks命令官方详解homer findpeaks
homer 的 de nove motif分析 报告参数解释 motif结果分析
homer软件自带了一些人类 ,小鼠,拟南芥等模式动植物的参考基因组,如果homer没有内置你需要的物种基因组 可以通过loadgenome.pl命令内置.
#将甜橙基因组加入到homer参考数据库 官方文档 loadgenome.pl自定义参考基因组和注释
注释文件格式转换
软件cufflink先将华农甜橙的数据SWO.v3.0.gene.model.gff3 转换成SWO.v3.0.gene.model.gtf格式 方便homer载入注释
gffread SWO.v3.0.gene.model.gff3 -T -o SWO_V3_model.gtf
载入homer参考基因组的命令:
loadGenome.pl -name SWO_V3_right -org null -fasta SWO.v3.0.genome.fa -gtf ./SWO_v3_gene_model_gff3/SWO_V3_model.gtf
载入homer的甜橙基因组为 SWO_V3_right
载入homer甜橙参考基因组之后 motif预测分析命令为:
findMotifsGenome.pl 92_homer_peak.bed SWO_V3_right 92_de_nove_motif/ -size 200 -len 6
In general, when analyzing ChIP-Seq / ChIP-Chip peaks you should expect to see strong enrichment for a motif resembling the site recognized by the DNA binding domain of the factor you are studying. Enrichment p-values reported by HOMER should be very very significant (i.e. << 1e-50). If this is not the case, there is a strong possibility that the experiment may have failed in one way or another. For example, the peaks could be of low quality because the factor is not expressed very high.
homer需要另外关注的一些东西Finding instances of motifs near peaks(http://homer.ucsd.edu/homer/ngs/quantification.html)
找到预测的motif之后将motif回帖到参考基因组有三种方式
方法1:The scanMotifGenomeWide.pl script will take a motif file (may contain multiple motifs) and look for instances across the genome. The basic command looks like this (output file is sent to stdout):
scanMotifGenomeWide.pl <motif file> <genome> [options]
例:scanMotifGenomeWide.pl /home/xz/biodata_cut_tag/biodata2021-12-23/alignment/92_team/92_de_nove_motif/knownResults/known26.motif SWO_V3_right -bed > /home/xz/biodata_cut_tag/biodata2021-12-23/alignment/92_team/92_motif_annote/known_motif26_ASL18_LOBAS2/92_motif26_to_swo3_.bed
scanMotifGenomeWide.pl pu1.motif mm9 -bed > pu1.sites.mm9.bed 参考文档
方法2:1. Run findMotifsGenome.pl with the "-find <motif file>" option. This will output a tab-delimited text file with each line containing an instance of the motif in the target peaks. The output is sent to stdout.
For example: findMotifsGenome.pl ERalpha.peaks hg18 MotifOutputDirectory/ -find motif1.motif > outputfile.txt
The output file will contain the columns:
Peak/Region ID
Offset from the center of the region
Sequence of the site
Name of the Motif
Strand
Motif Score (log odds score of the motif matrix, higher scores are better matches)
findMotifsGenome.pl ./92_homer_peak.bed SWO_V3_right outdir/ -find ./known39.motif > motif39_2.txt
findMotifsGenome.pl 92_homer_peak.bed SWO_V3_right -find /home/xz/biodata_cut_tag/biodata2021-12-23/alignment/92_team/92_de_nove_motif/knownResults/known26.motif > home/xz/biodata_cut_tag/biodata2021-12-23/alignment/92_team/92_motif_annote/known_motif26_ASL18_LOBAS2/motif_26_2.txt
方法3:Run annotatePeaks.pl with the "-m <motif file>" option (see the annotation section for more info). Chuck prefers doing it this way. This will output a tab-delimited text file with each line containing a peak/region and a column containing instance of each motif separated by commas to stdout
annotatePeaks.pl ERalpha.peaks hg18 -m motif1.motif > outputfile.txt
other:脚本文件 根据txt文件得染色体位置 找序列
#!/bin/bash #通过samtools 可以用染色体的位置信息找序列 不过也需要先建立samtools索引
#根据92_peak_site_seq.txt 染色体位置找到对应序列
cat 92_peak_site_seq.txt | while read i
do
chr_num=$(echo ${i}|cut -d ' ' -f 1)
start_num=$(echo ${i}|cut -d ' ' -f 2)
end_num=$(echo ${i}|cut -d ' ' -f 3)
echo "${chr_num}:${start_num}-${end_num}"
samtools faidx /home/xz/biodata_cut_tag/biodata2021-12-23/SWO_GENOME_REF/SWO.v3.0.genome.fa ${chr_num}:${start_num}-${end_num} >>seq.txt #找到序列
done