数据准备
数据来源
测序下机数据
直接进行质控和后续分析;
已发表数据
以“The PP2A-interactor TIP41 modulates ABA responses in Arabidopsis thaliana”,这篇文章为例;
获取bioproject号;
NCBI数据库搜索
获取SRR号
SRR号如下:
SRR6281250 【tip41ABA2】tip41_1 12 d old in vitro on germination media and treated for 3h with 50 μM ABA_rep2
SRR6281251 【tip41ABA3】tip41_1 12 d old in vitro on germination media and treated for 3h with 50 μM ABA_rep3
SRR6281252 【tip41Con1】tip41_1 12 d old in vitro on germination media_rep1
SRR6281253 【tip41Con2】tip41_1 12 d old in vitro on germination media_rep2
SRR6281254 【ColABA2】Col-0 12 d old in vitro on germination media and treated for 3h with 50 μM ABA_rep2
SRR6281255 【ColABA3】Col-0 12 d old in vitro on germination media and treated for 3h with 50 μM ABA_rep3
SRR6281256 【ColCon3】Col-0 12 d old in vitro on germination media_rep3
SRR6281257 【ColABA1】Col-0 12 d old in vitro on germination media and treated for 3h with 50 μM ABA_rep1
SRR6281258 【ColCon1】Col-0 12 d old in vitro on germination media_rep1
SRR6281259 【ColCon2】Col-0 12 d old in vitro on germination media_rep2
SRR6281260 【tip41Con3】tip41_1_12 d old in vitro on germination media_rep3
SRR6281261 【tip41ABA1】tip41_1 12 d old in vitro on germination media and treated for 3h with 50 μM ABA_rep1
数据下载 【sratoolkit prefetch】
vim prefetch.sh
写一个for循环
for ((i = 50; i <= 61; i++))
do prefetch SRR62812$i 2> prefetch.log
done
运行程序
nohup bash prefetch.sh &
数据拆分 【sratoolkit fastq-dump】
SRR数据下载完成后需要进行拆分成fastq格式才能用于后续分析,数据拆分使用NCBI sratoolkit工具包中的fastq-dump程序;
nohup fastq-dump --defline-seq '@$sn[_$rn]/$ri' --split-files *.sra 2> fastq-dump.log &
质控 【fastp】
使用fastp对原始数据进行质控;
vim fastp.sh
for file in `ls ../01_data/*_1.fastq | perl -lpe 's/_1.fastq//'`;
do fastp -i ${file}_1.fastq -I ${file}_2.fastq -o ${file}_R1.clean.fq.gz -O ${file}_R2.clean.fq.gz -j ${file}.json -h ${file}.html 2> ${file}.log;
done
nohup bash fastp.sh 2> fastp.log &
参考基因组准备
对于有参转录组分析,需要准备参考基因组文件(fasta file) 和基因组注释文件(gff3/gtf file),在比对前需将gff3格式转换成gtf格式;
gff3转换gtf
gffread TAIR10_GFF3_genes.gff -T -o TAIR10_gtf_genes.gtf
比对 【hisat2】
这一步需要将质控后的数据比对到参考基因组上,进行比对的软件有很多,在此使用hisat2,生成sam文件;比对之前需要对参考基因组建立索引;
vim hisat2.sh
genome=../03_genome/TAIR10_chr_all.fa
genome_prefix=../03_genome/TAIR10_chr_all
hisat2-build $genome $genome_prefix 1>hisat2-build.log 2>hisat2-build.err
for sample in `ls ../02_fastp/*_R1.clean.fq.gz | perl -lpe 's/_R1.clean.fq.gz//'`;
do hisat2 --new-summary -p 2 -x $genome_prefix -1 ${sample}_R1.clean.fq.gz -2 ${sample}_R2.clean.fq.gz ${sample}.sam 2> ${sample}.err;
done
nohup bash hisat2.sh 2> hisat2.log &
排序去重 【samtools sort】
将生成的sam文件转换成bam文件,并排序去重,用于后续counts数计算;
for file in *.sam;do
samtools view -b $file > ${file%.sam}.bam
samtools sort ${file%.sam}.bam > ${file%.sam}.sorted.bam
samtools index ${file%.sam}.sorted.bam
done
counts数计算 【Rsubread】
使用Rsubread包计算counts数;
vim featurecounts.sh
for file in ../05_sorted/*.sorted.bam; do
Rscript run-featurecounts.R ../05_sorted/$file ../03_genome/TAIR10_gtf_genes.gtf 10 $file
done
vim run-featurecounts.R
library(Rsubread)
library(limma)
library(edgeR)
args <- commandArgs( trailingOnly = TRUE )
bamFile <- args[1]
gtfFile <- args[2]
nthreads <- args[3]
outFilePref <- args[4]
outStatsFilePath <- paste(outFilePref, '.stat', sep = '');
outCountsFilePath <- paste(outFilePref, '.count', sep = '');
fCountsList = featureCounts(bamFile, annot.ext=gtfFile, isGTFAnnotationFile=TRUE, nthreads=nthreads, isPairedEnd=TRUE)
dgeList = DGEList(counts=fCountsList$counts, genes=fCountsList$annotation)
fpkm = rpkm(dgeList, dgeList$genes$Length)
tpm = exp(log(fpkm) - log(sum(fpkm)) + log(1e6))
write.table(fCountsList$stat, outStatsFilePath, sep="\t", col.names=FALSE, row.names=FALSE, quote=FALSE)
featureCounts = cbind(fCountsList$annotation[,1], fCountsList$counts, fpkm, tpm)
colnames(featureCounts) = c('gene_id', 'counts', 'fpkm','tpm')
write.table(featureCounts, outCountsFilePath, sep="\t", col.names=TRUE, row.names=FALSE, quote=FALSE)
nohup bash featurecounts.sh 2> featurecounts.log &
生成表达矩阵 【reshape2】
将所有的counts数合并成矩阵,用于后续分析;
vim matrix.sh
perl -lne 'if ($ARGV=~/(.*).count/){print "$1\t$_"}' *.count > matrix.count
awk -F"\t" '{print $1"\t"$2"\t"$3}' matrix.count > gene.count
awk -F"\t" '{print $1"\t"$2"\t"$5}' matrix.count > gene.tpm
vim matrix.R
a=read.csv('gene.count',header=F,sep="\t")
colnames(a)=c('sample','gene','count')
library(reshape2)
counts=dcast(a,formula=gene~sample)
write.table(counts,file="gene_count.txt",sep="\t",quote=FALSE,row.names=FALSE)
Rscript matrix.R
查看样本相关系数
library(pheatmap)
setwd("~/2021_DengLab_Bioinformatics_training—RNASeq/difference_expression/")
tpmTable <- read.table("gene_tpm_sort.txt", check.names = F, sep = "\t", row.names = 1,
header = T)
ddCor <- cor(tpmTable, method = "spearman")
pheatmap(file = "allSampleWithNum.cor.pdf", ddCor, clustering_method = "average",
display_numbers = T)
差异表达分析
准备counts文件
head -n 1 gene_count.txt
gene tip41ABA2 tip41ABA3 tip41Con1 tip41Con2 ColABA2 ColABA3 ColCon3 ColABA1 ColCon1 ColCon2 tip41Con3 tip41ABA1
awk -F"\t" '{print $1"\t"$10"\t"$11"\t"$8"\t"$9"\t"$6"\t"$7"\t"$4"\t"$5"\t"$12"\t"$13"\t"$2"\t"$3}' gene_count.txt > gene_count_sort.txt
head -n 1 gene_count.txt > 1.txt
cat gene_tpm.txt | head -n 33324 | tail -n +2 >> 1.txt
awk -F"\t" '{print $1"\t"$10"\t"$11"\t"$8"\t"$9"\t"$6"\t"$7"\t"$4"\t"$5"\t"$12"\t"$13"\t"$2"\t"$3}' 1.txt > gene_tpm_sort.txt
awk -F"\t" '{print $1"\t"$5"\t"$6"\t"$7"\t"$11"\t"$12"\t"$13}' gene_count_sort.txt > gene_ColABA_vs_tip41ABA_count.txt
awk -F"\t" '{print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7}' gene_count_sort.txt > gene_ColCon_vs_ColABA_count.txt
awk -F"\t" '{print $1"\t"$8"\t"$9"\t"$10"\t"$11"\t"$12"\t"$13}' gene_count_sort.txt > gene_tip41Con_vs_tip41ABA_count.txt
awk -F"\t" '{print $1"\t"$2"\t"$3"\t"$4"\t"$8"\t"$9"\t"$10}' gene_count_sort.txt > gene_ColCon_vs_tip41Con_count.txt
#cut -f1,2,3,4,8,9,10 gene_count_sort.txt > gene_ColCon_vs_tip41Con_count.txt
差异表达分析
【ColCon_vs_ColABA】
#载入DESeq2包
library(DESeq2)
#【ColCon_vs_ColABA】
#设置工作目录
setwd("~/2021_DengLab_Bioinformatics_training—RNASeq/difference_expression/ColCon_vs_ColABA/")
#读取数据
raw <- read.table("gene_ColCon_vs_ColABA_count.txt", header = T, row.names = 1)
#矩阵化
Raw <- as.matrix(raw)
#Use DESeq2 to identify DE genes
raw_condition <- factor(c(rep("ColCon",3),rep("ColABA",3)))
coldata <- data.frame(row.names=colnames(raw), raw_condition)
dds <- DESeqDataSetFromMatrix(countData=raw, colData=coldata, design=~raw_condition)
dds <- DESeq(dds)
res <- results(dds)
resOrdered <- res[order(res$padj),]
summary(resOrdered)
# out of 24871 with nonzero total read count
# adjusted p-value < 0.1
# LFC > 0 (up) : 4304, 17%
# LFC < 0 (down) : 4916, 20%
# outliers [1] : 12, 0.048%
# low counts [2] : 3315, 13%
# (mean count < 2)
# [1] see 'cooksCutoff' argument of ?results
# [2] see 'independentFiltering' argument of ?results
#查看有显著差异的基因的个数
print(sum(resOrdered$padj < 0.01 & abs(resOrdered$log2FoldChange)>1, na.rm=TRUE))
# [1] 2406
#output DE results
res <- res[order(res$padj), ]
resdata <- merge(as.data.frame(res), as.data.frame(counts(dds, normalized=TRUE)),
by="row.names", sort=FALSE)
names(resdata)[1] <- "Gene"
#过滤结果
resdata_filter <- subset(resdata,resdata$padj<0.01 & abs(resdata$log2FoldChange) >1)
#保存结果
write.table(resdata_filter, file="diffexpr_ColCon_vs_ColABA_0.01.txt",sep="\t",
row.names=F)
PCA分析
输入的矩阵不同,得出的结论也会不同。由于基因的表达水平在不同样本间本身就存在一定的差异,所以无论是采用原始的还是归一化之后的表达量矩阵,效果都不理想。针对这一问题,DESeq2提出了两种count值的转换算法,rlog和VST转换。
dds <- estimateSizeFactors(dds)
raw <- SummarizedExperiment(counts(dds, normalized=FALSE), colData=colData(dds))
nor <-SummarizedExperiment(counts(dds, normalized=TRUE), colData=colData(dds))
#vst转换
vsd <- vst(dds)
#rlog转换
rld <- rlog(dds)
#两种转换本质上是在降低生物学重复之间的差异,使得样本聚类和PCA分析的效果更好。转换之后的表达量数据可以采用assay函数进行提取;VST和rlog转换之后,生物学重复之间更佳的接近,不同分组也区分的较为明显
head(assay(rld[, 1:6]))
# ColCon1 ColCon2 ColCon3 ColABA1 ColABA2 ColABA3
# AT1G01010 8.308275 8.318305 7.905503 7.940777 7.976028 7.990062
# AT1G01020 7.349027 7.310091 7.511672 7.180577 7.253919 7.214220
# AT1G01030 6.437458 6.479145 6.413880 6.140799 6.081422 6.110596
# AT1G01040 10.296109 10.303349 10.415850 10.340620 10.284638 10.326584
# AT1G01046 3.061996 3.039576 3.187465 2.977824 2.929619 2.972257
# AT1G01050 9.849337 9.843198 9.789624 9.783000 9.791312 9.907184
#对于raw count data,采用rlog或者VST转换之后的数据去进行PCA和聚类分析,效果会更好。
plotPCA(vsd, intgroup=c("raw_condition"), returnData=TRUE)
# PC1 PC2 group raw_condition name
# ColCon1 -34.69931 9.36822458 ColCon ColCon ColCon1
# ColCon2 -25.90069 4.80286144 ColCon ColCon ColCon2
# ColCon3 -26.81507 -15.38832746 ColCon ColCon ColCon3
# ColABA1 29.88549 3.38267261 ColABA ColABA ColABA1
# ColABA2 27.76088 -0.09886219 ColABA ColABA ColABA2
# ColABA3 29.76870 -2.06656898 ColABA ColABA ColABA3
plotPCA(vsd, intgroup=c("raw_condition"))
【ColCon_vs_tip41Con】
#设置工作目录
setwd("~/2021_DengLab_Bioinformatics_training—RNASeq/difference_expression/ColCon_vs_tip41Con/")
#读取数据
raw <- read.table("gene_ColCon_vs_tip41Con_count.txt", header = T, row.names = 1)
#矩阵化
Raw <- as.matrix(raw)
#Use DESeq2 to identify DE genes
raw_condition <- factor(c(rep("ColCon",3),rep("tip41Con",3)))
coldata <- data.frame(row.names=colnames(raw), raw_condition)
dds <- DESeqDataSetFromMatrix(countData=raw, colData=coldata, design=~raw_condition)
dds <- DESeq(dds)
res <- results(dds)
resOrdered <- res[order(res$padj),]
summary(resOrdered)
#查看有显著差异的基因的个数
print(sum(resOrdered$padj < 0.01 & abs(resOrdered$log2FoldChange)>1, na.rm=TRUE))
# [1] 2406
#output DE results
res <- res[order(res$padj), ]
resdata <- merge(as.data.frame(res), as.data.frame(counts(dds, normalized=TRUE)),
by="row.names", sort=FALSE)
names(resdata)[1] <- "Gene"
#过滤结果
resdata_filter <- subset(resdata,resdata$padj<0.01 & abs(resdata$log2FoldChange) >1)
#保存结果
write.table(resdata_filter, file="diffexpr_ColCon_vs_ColABA_0.01.txt",
sep="\t",row.names=F)
【ColABA_vs_tip41ABA】
#设置工作目录
setwd("~/2021_DengLab_Bioinformatics_training—RNASeq/difference_expression/ColABA_vs_tip41ABA/")
#读取数据
raw <- read.table("gene_ColABA_vs_tip41ABA_count.txt", header = T, row.names = 1)
#矩阵化
Raw <- as.matrix(raw)
#Use DESeq2 to identify DE genes
raw_condition <- factor(c(rep("ColABA",3),rep("tip41ABA",3)))
coldata <- data.frame(row.names=colnames(raw), raw_condition)
dds <- DESeqDataSetFromMatrix(countData=raw, colData=coldata, design=~raw_condition)
dds <- DESeq(dds)
res <- results(dds)
resOrdered <- res[order(res$padj),]
summary(resOrdered)
#查看有显著差异的基因的个数
print(sum(resOrdered$padj < 0.01 & abs(resOrdered$log2FoldChange)>1, na.rm=TRUE))
# [1] 2406
#output DE results
res <- res[order(res$padj), ]
resdata <- merge(as.data.frame(res), as.data.frame(counts(dds, normalized=TRUE)),
by="row.names", sort=FALSE)
names(resdata)[1] <- "Gene"
#过滤结果
resdata_filter <- subset(resdata,resdata$padj<0.01 & abs(resdata$log2FoldChange) >1)
#保存结果
write.table(resdata_filter, file="diffexpr_ColABA_vs_tip41ABA_0.01.txt",
sep="\t",row.names=F)
【tip41Con_vs_tip41ABA】
#设置工作目录
setwd("~/2021_DengLab_Bioinformatics_training—RNASeq/difference_expression/tip41Con_vs_tip41ABA/")
#读取数据
raw <- read.table("gene_tip41Con_vs_tip41ABA_count.txt", header = T, row.names = 1)
#矩阵化
Raw <- as.matrix(raw)
#Use DESeq2 to identify DE genes
raw_condition <- factor(c(rep("tip41Con",3),rep("tip41ABA",3)))
coldata <- data.frame(row.names=colnames(raw), raw_condition)
dds <- DESeqDataSetFromMatrix(countData=raw, colData=coldata, design=~raw_condition)
dds <- DESeq(dds)
res <- results(dds)
resOrdered <- res[order(res$padj),]
summary(resOrdered)
#查看有显著差异的基因的个数
print(sum(resOrdered$padj < 0.01 & abs(resOrdered$log2FoldChange)>1, na.rm=TRUE))
# [1] 2406
#output DE results
res <- res[order(res$padj), ]
resdata <- merge(as.data.frame(res), as.data.frame(counts(dds, normalized=TRUE)),
by="row.names", sort=FALSE)
names(resdata)[1] <- "Gene"
#过滤结果
resdata_filter <- subset(resdata,resdata$padj<0.01 & abs(resdata$log2FoldChange) >1)
#保存结果
write.table(resdata_filter, file="diffexpr_tip41Con_vs_tip41ABA_0.01.txt",
sep="\t",row.names=F)
韦恩图绘制
- gplots包中的venn函数
- venneuler包中的venneuler函数
- venn包中的venn函数
- venndiagram包中的venn.diagram函数
setwd("~/2021_DengLab_Bioinformatics_training—RNASeq/venn/")
install.packages("VennDiagram")
library(VennDiagram)
### venn.diagram (图形输出到文件)
# 显示当前工作目录
listA <- read.table("gene_list_ColABA_vs_tip41ABA.txt", header = FALSE)
listB <- read.table("gene_list_ColCon_vs_ColABA.txt", header = FALSE)
listC <- read.table("gene_list_ColCon_vstip41Con_.txt", header = FALSE)
listD <- read.table("gene_list_tip41Con_vs_tip41ABA.txt", header = FALSE)
ColABA_vs_tip41ABA <- listA$V1
ColCon_vs_ColABA <- listB$V1
ColCon_vstip41Con <- listC$V1
tip41Con_vs_tip41ABA <- listD$V1
length(ColABA_vs_tip41ABA); length(ColCon_vs_ColABA);
length(ColCon_vstip41Con); length(tip41Con_vs_tip41ABA)
# 四个样品的韦恩图
venn.diagram(list(ColABA_vs_tip41ABA = ColABA_vs_tip41ABA,
ColCon_vs_ColABA = ColCon_vs_ColABA,
ColCon_vstip41Con = ColCon_vstip41Con,
tip41Con_vs_tip41ABA = tip41Con_vs_tip41ABA),
fill = c("yellow", "red", "cyan", "forestgreen"),
cex = 1.5, filename = "Col_vs_tip41_venn4.png",
output = TRUE, imagetype = "png",
height = 100, width = 100,
resolution = 100,
compression = "lzw")
#保存为pdf
p <- venn.diagram(list(ColABA_vs_tip41ABA = ColABA_vs_tip41ABA,
ColCon_vs_ColABA = ColCon_vs_ColABA,
ColCon_vstip41Con = ColCon_vstip41Con,
tip41Con_vs_tip41ABA = tip41Con_vs_tip41ABA),
fill = c("yellow", "red", "cyan", "forestgreen"),
cex = 1.5, filename = NULL, output = TRUE)
pdf("Col_vs_tip41_venn4.pdf")
grid.draw(p)
# 绘图参数
venn.diagram(
x = list(set1, set2, set3),
category.names = c("Set 1" , "Set 2 " , "Set 3"),
filename = 'venn2.png',
output=TRUE,
# 输出
imagetype="png" , # 类型(tiff png svg)
#height = 1000 , # 高度
#width = 1000 , # 宽度
resolution = 400, # 分辨率
compression = "lzw", # 压缩算法
# 圈
lwd = 5, # 圈线条粗细 1 2 3 4 5
lty = 1, # 线条类型, 1 实线, 2 虚线, blank 无线条
fill = color, # 填充色
col = c("red", 'green', 'blue'), # 线条色
# 数字 number
cex = 2, # 数字大小
fontface = "bold", # 加粗
fontfamily = "sans", # 字体
# 标签 category
cat.cex = 2, # 字体大小
cat.col = c("red", 'green', 'blue'), # 字体色
cat.fontface = "bold", # 加粗
cat.default.pos = "outer", # 位置, outer 内 text 外
cat.pos = c(-27, 27, 135), # 位置,用圆的度数
cat.dist = c(0.055, 0.055, 0.085), # 位置,离圆的距离
cat.fontfamily = "sans", # 字体
rotation = 1 # 1 2 3 旋转确定大打头数据集
)
基因表达量分布热图
R绘制热图方法
- heatmap()
- gplots包heatmap.2()
- pheatmap包pheatmap()
- ComplexHeatmap包
setwd("~/2021_DengLab_Bioinformatics_training—RNASeq/heatmap/")
m <- read.table("DEgene_tpm_ColABA_vs_tip41ABA.txt", header = T, row.names = 1)
class(m)
x <- as.matrix(m)
heatmap(x)
# 样品名和基因名互换t()函数转置
heatmap(t(x))
heatmap(x, col = c("green", "red"))
# 随机生成渐变色
colorRampPalette(c("red", "black", "green"))(nrow(x))
yanse <- colorRampPalette(c("green", "black", "red"))(nrow(x))
heatmap(x, col = yanse)
# 设置颜色条
heatmap(x, col = yanse, ColSideColors = colorRampPalette(c("red", "black", "green"))
(ncol(x)))
# 聚类设置 Rowv(按行聚类);Colv(按列聚类)
heatmap(x, col = yanse, Colv = NA, Rowv = NA)
library(pheatmap)
pheatmap(x)
?pheatmap
colnames(x)
annotation_col <- data.frame(CellType = factor(rep(
c("ColCon", "ColABA", "tip41Con", "tip41ABA"), each = 3)))
# 行号替换为样品名(样品和分组信息对应)
rownames(annotation_col) <- colnames(x)
# 对列分组
pheatmap(x, annotation_col = annotation_col)
# display_numbers显示数值
pheatmap(x, annotation_col = annotation_col, display_numbers = T)
# number_format = "0.2f" 设置数值格式(保留两位小数)
pheatmap(x, annotation_col = annotation_col, display_numbers = T, number_format = "0.2f")
pheatmap(x, annotation_col = annotation_col, display_numbers = T, number_format = "0.1f")
火山图
setwd("~/2021_DengLab_Bioinformatics_training—RNASeq/Volcano/")
g <- read.table("diffexpr_ColCon_vs_ColABA_all.txt", header = T, row.names = 1)
head(g)
# 去掉缺失行
g <- na.omit(g)
plot(g$log2FoldChange, g$padj)
plot(g$log2FoldChange, -1*log10(g$padj))
plot(g$log2FoldChange, -1*log10(g$padj), xlim = c(-10, 10), ylim = c(0, 100))
g <- transform(g, padj = -1*log10(g$padj))
down <- g[g$log2FoldChange <= -1,]
up <- g[g$log2FoldChange >=1,]
no <- g[g$log2FoldChange > -1 & g$log2FoldChange < 1,]
plot(no$log2FoldChange, no$padj, xlim = c(-10,10),
ylim = c(0,100), col = "blue", pch = 16, cex = 0.8,
main = "Gene Expression", xlab = "log2FoldChange",
ylab = "-log10(Qvalue)")
points(up$log2FoldChange, up$padj, col = "red", pch = 16, cex = 0.8)
points(down$log2FoldChange, down$padj, col = "green", pch = 16, cex = 0.8)
COG
Clusters of Orthologous Groups of proteins
http://www.ncbi.nlm.nih.gov/COG/
http://www.ncbi.nlm.nih.gov/COG/old/xognitor.html
http://www.nlm.nih.gov/COG/old/COGhelp.html
ftp://ftp.ncbi.nih.gov/pub/COG/
COG注释作用:
- 通过已知蛋白对未知序列进行功能注释;
- 通过查看指定的COG编号对应的protein数目,存在及缺失,从而能推导特定的代谢途径是否存在;
- 每个COG编号是一类蛋白,将query序列和比对上的COG编号的proteins进行多序列比对,能确定保守位点,分析其进化关系。
使用eggNOG-mapper在线工具进行COG和GO注释,需要准备差异基因氨基酸序列;
输入数据和邮箱在线提交即可;
运行完成后下载<pre style="color: rgb(0, 0, 0); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">out.emapper.annotations </pre>文件;
处理结果文件,提取COG category信息,用于分析和作图;
COG功能注释图
设置工作目录
setwd("~/2021_DengLab_Bioinformatics_training—RNASeq/COG/")
m <- read.table("cog.class.annot.txt", header = T, sep = "\t")
head(m)
layout(matrix(c(1,2), nrow = 1), widths = c(20,13))
layout.show(2)
par(mar = c(3,4,4,1) + 0.1)
# 设定顺序
class <- c("J","A","K","L","B","D","V","T","M","Z","W","U","O","C","G","E","F","H","I","P","Q","S")
# 将m中Code列转化为因子
t <- factor(as.character(m$Code), levels = class)
# 用新的因子数据对原始数据进行排序
m <- m[order(t),]
head(m)
barplot(m$Gene.Number, space = F, col = rainbow(22), ylab = "Number of Genes", names.arg = m$Code)
# 添加分组信息
l <- c(0, 5, 13, 21, 22)
id <- c("INFORMATION STORAGE\nAND PROCESSING", "CELLULAR PROCESSES\nAND SIGNALING", "METABOLISM", "POORLY\n CHAPACTERIZED")
id
abline(v = l[c(-1,-5)])
for (i in 2:length(l)) {
text((l[i-1] + l[i])/2, max(m[,3])*1.1, id[i-1], cex = 0.8, xpd = T)
}
par(mar = c(2,0,2,1) + 0.1)
# axes 不显示坐标轴
plot(0,0,type = "n", xlim = c(0,1), ylim = c(0,26), bty = "n", axes = F, xlab = "", ylab = "")
for ( i in 1:length(class)){
text(0,26-i+0.5, paste(m$Code[i], m$Functional.Categories[i]), pos = 4, cex = 1, pty = T)
}
# 添加标题
title(main = "COG Function Classification")
GO enrichment analysis
http://eggnog5.embl.de/#/
https://github.com/eggnogdb/eggnog-mapper/wiki
http://current.geneontology.org/ontology/go.obo
GO号获取
基因组功能注释文件
eggNOG-mapper
http://eggnog-mapper.embl.de/job_status?jobname=MM_z9vko3fw
geneontology.org
http://current.geneontology.org/products/pages/downloads.html
http://current.geneontology.org/annotations/tair.gaf.gz
从ATH_GO_GOSLIM.txt提取差异基因GO号;
awk -F"\t" '{print $1"\t"$5"\t"$6"\t"$8}' ATH_GO_GOSLIM.txt > 1.txt
cat 1.txt | grep -f DEgene_list_GOenrichment.txt > 2.txt
awk -F"\t" '{print $4"\t"$2"\t"$3}' 2.txt > 3.txt
sort -k3,4 3.txt | uniq -c > 4.txt
awk -F" " '{print $1}' 4.txt > 5.txt
awk -F"\t" '{print $1}' 4.txt | awk -F" " '{print $2}' > 6.txt
awk -F"\t" '{print $2"\t"$3}' 4.txt > 7.txt
cat 5.txt | awk '{sum+=$1} END {print "Sum = ", sum}'
awk '{print $1"\t""39933"}' 5.txt > 55.txt
echo "$1 $2" | awk -F"\t" '{printf("%.2f\n", ($1/$2)*100)}' 55.txt > 555.txt
sed -i 's/F/Molecular function/g' 6.txt
sed -i 's/C/Cellular component/g' 6.txt
sed -i 's/P/Biological process/g' 6.txt
paste 6.txt 7.txt 555.txt > 8.txt
sed -i '1i\Ontology\tTerm\tGO_ID\tInput_number\tPercentage' 8.txt
mv 8.txt DEgene_GO_term.txt
R包获取GO号和对应注释信息;
library(GO.db)
goterms <- Term(GOTERM)
GOlist = as.data.frame(goterms)
View(GOlist)
keytypes(GO.db)
#[1] "DEFINITION" "GOID" "ONTOLOGY" "TERM"
columns(GO.db)
#[1] "DEFINITION" "GOID" "ONTOLOGY" "TERM"
keys(GO.db, keytype = "GOID")[1:3]
#[1] "GO:0000001" "GO:0000002" "GO:0000003"
k <- keys(GO.db, keytype = "GOID")[1:3]
select(GO.db, keys = k, columns = c("TERM","ONTOLOGY"),
+ keytype="GOID")
'select()' returned 1:1 mapping between keys and columns
# GOID TERM ONTOLOGY
#1 GO:0000001 mitochondrion inheritance BP
#2 GO:0000002 mitochondrial genome maintenance BP
#3 GO:0000003 reproduction BP
write.table(golist, file = "GOlist.txt", sep = "\t", row.names = F)
GO功能注释条形图
setwd("~/2021_DengLab_Bioinformatics_training—RNASeq/GO_enrichment/")
go <- read.table("DEgene_GO_term.txt", header = T, sep = "\t")
head(go)
View(go)
library(ggplot2)
# 排序
go_sort <- go[order(go$Ontology, -go$Percentage),]
m <- go_sort[go_sort$Ontology == "Molecular function",][1:10,]
c <- go_sort[go_sort$Ontology == "Cellular component",][1:10,]
b <- go_sort[go_sort$Ontology == "Biological process",][1:10,]
# 合并向量
slimgo <- rbind(b, c, m)
slimgo
# 将Term列转化为因子
slimgo$Term = factor(slimgo$Term, levels = slimgo$Term)
colnames(slimgo)
PP <- ggplot(data = slimgo, mapping = aes(x = Term, y = Percentage, fill = Ontology))
PP
PP + geom_bar(stat = "identity")
# 翻转坐标轴
PP + geom_bar(stat = "identity") + coord_flip()
PP + geom_bar(stat = "identity") + coord_flip() +
scale_x_discrete(limits = rev(levels(slimgo$Term)))
# 去掉图例
PP + geom_bar(stat = "identity") + coord_flip() +
scale_x_discrete(limits = rev(levels(slimgo$Term))) +
guides(fill = FALSE)
# 切换黑白背景主题
PP + geom_bar(stat = "identity") + coord_flip() +
scale_x_discrete(limits = rev(levels(slimgo$Term))) +
guides(fill = FALSE) + theme_bw()
KEGG enrichment analysis
针对非模式材料,使用KAAS在线工具进行KEGG注释;准备参考基因组的蛋白序列文件(fasta格式),上传数据,填写邮箱,提交即可;
https://www.genome.jp/kaas-bin/kaas_main
编写python脚本,提取需要的信息;
python3 kegg-to-geneID.py -i q00001.keg -o TAIR10_kegg.txt
awk -F"\t" '{print $5"\t""ko:"$3"\t"$4}' TAIR10_kegg.txt > TAIR10_kegg_pathway_annotation.txt
ClusterProfiler对差异基因进行富集分析
setwd("~/2021_DengLab_Bioinformatics_training—RNASeq/KEGG_enrichment/")
### ClusterProfiler进行富集分析
#安装clusterProfiler包
BiocManager::install("clusterProfiler")
library(clusterProfiler)
### 准备pathway注释文件
data <- read.table("TAIR10_KEGG_annotation.txt", header = T, sep = "\t")
go2gene <- data[, c(2, 1)]
go2name <- data[, c(2, 3)]
### 准备差异基因列表文件
genelist <- read.table("DEgene_listKEGG_enrichment.txt", header = T)
### 差异基因ID向量
gene <- as.factor(genelist$Gene_ID)
### sorted vector: number should be sorted in decreasing order
gene <- sort(gene, decreasing = TRUE)
### ORA(Over-Representation Analysis)
kk <- enricher(gene,TERM2GENE = go2gene,TERM2NAME = go2name, pvalueCutoff = 0.05,
pAdjustMethod = "BH",
minGSSize = 5,
qvalueCutoff = 0.5)
write.table(kk, file = "diff-keggenricher.txt", quote =F, sep = "\t", row.names = F)
GSEA富集分析
### GSEA(Gene Set Enrichment Analysis)
### ## assume 1st column is ID
## 2nd column is FC
## head(geneList)
## Entrez.ID Fold.Change
## feature 1: numeric vector
glist <- genelist[,2];head(glist)
## feature 2: named vector
names(glist) <- as.character(genelist[,1]);head(glist)
## feature 3: decreasing order
glist <- sort(glist,decreasing = T); head(glist)
#gsea <- GSEA(glist, TERM2GENE=c5, verbose=FALSE, pvalueCutoff = 0.8)
kk <- GSEA(glist, TERM2GENE = go2gene, TERM2NAME = go2name, nPerm = 1000,
minGSSize = 10,
maxGSSize = 500,
pvalueCutoff = 0.05,
pAdjustMethod = "BH")
KEGG功能注释图(气泡图)
library(ggplot2)
pathway <- read.csv("diff-keggenricher.csv", header = T, sep = ",")
# 查看列名
colnames(pathway)
# [1] "ID" "Description" "GeneRatio" "BgRatio" "pvalue" "p.adjust"
# [7] "qvalue" "geneID" "Count" "All_unigene" "richFactor"
# 映射数据
PP <- ggplot(data = pathway, mapping = aes(x = richFactor, y = Description))
# 绘制点图
PP + geom_point()
# 将点的大小映射到Count列
PP + geom_point(mapping = aes(size = Count))
# qvalue映射到color
PP + geom_point(mapping = aes(size = Count, color = qvalue))
# 修改颜色
PP + geom_point(mapping = aes(size = Count, color = qvalue)) +
scale_color_gradient(low = "green", high = "red")
# 修改标题
PP + geom_point(mapping = aes(size = Count, color = qvalue)) +
scale_color_gradient(low = "green", high = "red") +
labs(title = "Top20 of Pathway Enrichment", x = "Rich Factor",
y = "Pathway Name", color = "-log10(qvalue)", size = "Gene Number") + theme_bw()
以上。