- 前面RNA-seq分析:从软件安装到富集分析部分已经把转录组全部流程走完了一遍,这次利用RNA-seq(2)-2:下载数据中下载的肝癌数据进行分析,不在赘述细节,所以有看细节的还是请去这里。
- 这部分主要是代码和部分结果图,但进行了部分修正,比如KEGG 可视化部分,用了前clusterprofiler部分的结果,所以这部分不包括Gage包。
- 所以从质量比对开始进行
-----------------------------------------------------------------
3 sra到fastq格式转换并进行质量控制
3.1 数据解压:用samtools中的fastq-dump将sra格式转为fastq格式
- 备注:用时5个小时
for ((i=2;i<=5;i++));do fastq-dump --gzip --split-3 -A SRR31621$i.sra -O .;done
3.2 用fastqc进行质量控制
#将所有的数据进行质控,得到zip的压缩文件和html文件
fastqc -o . *.fastq.gz
3.3 3 质控结果解读(为保持紧凑,只放一张图)
4 下载参考基因组及基因注释
RNA-seq(4):下载参考基因组及基因注释部分已经下载
5 序列比对:Hisat2
5.1 开始比对:用hisat2,得到SAM文件(5个小时)
- 我的fastq文件在
/mnt/f/rna_seq/data
- 我的reference在
/mnt/f/rna_seq/data/reference
- 我的index在
/mnt/f/rna_seq/data/reference/index/hg19
- 比对后得到的bam文件会存放在
/mnt/f/rna_seq/aligned
source ~/miniconda3/bin/activate
(base) kelly@DESKTOP-MRA1M1F:/mnt/f/rna_seq/data$ cd ..
(base) kelly@DESKTOP-MRA1M1F:/mnt/f/rna_seq$
for ((i=2;i<=5;i++));do hisat2 -t -x /mnt/f/rna_seq/data/reference/index/hg19/genome -1 /mnt/f/rna_seq/data/SRR31621${i}.sra_1.fastq.gz -2 /mnt/f/rna_seq/data/SRR31621${i}.sra_2.fastq.gz -S SRR31621${i}.sam ;done
(base) kelly@DESKTOP-MRA1M1F:/mnt/f/rna_seq$ for ((i=2;i<=5;i++));do hisat2 -t -x /mnt/f/rna_seq/data/reference/index/hg19/genome -1 /mnt/f/rna_seq/data/SRR31621${i}.sra_1.fastq.gz -2 /mnt/f/rna_seq/data/SRR31621${i}.sra_2.fastq.gz -S SRR31621${i}.sam ;done
Time loading forward index: 00:00:05
Time loading reference: 00:00:01
Multiseed full-index search: 02:11:39
101339131 reads; of these:
101339131 (100.00%) were paired; of these:
7840870 (7.74%) aligned concordantly 0 times
87196406 (86.04%) aligned concordantly exactly 1 time
6301855 (6.22%) aligned concordantly >1 times
----
7840870 pairs aligned concordantly 0 times; of these:
333400 (4.25%) aligned discordantly 1 time
----
7507470 pairs aligned 0 times concordantly or discordantly; of these:
15014940 mates make up the pairs; of these:
8711709 (58.02%) aligned 0 times
5441734 (36.24%) aligned exactly 1 time
861497 (5.74%) aligned >1 times
95.70% overall alignment rate
5.2 SAM文件转换为BAM文件,并对bam文件进行sort,最后建立索引:SAMtools
Sam 文件转换为bam格式(用时1.5hrs)
(base) kelly@DESKTOP-MRA1M1F:/mnt/f/rna_seq/aligned$ for ((i=2;i<=5;i++));do samtools view -S /mnt/f/rna_seq/SRR31621${i}.sam -b > SRR31621${i}.bam;done
对bam文件进行排序,默认染色体位置(用时1.5hrs)
for ((i=2;i<=5;i++));do samtools sort SRR31621${i}.bam
-o SRR31621${i}_sorted.bam;done
建立索引(用时40mins)
for ((i=2;i<=5;i++));do samtools index SRR31621${i}_sorted.bam;done
6 reads计数,合并矩阵并进行注释
6.1 bam文件按reads name排序(用时1h)
for ((i=2;i<=5;i++));do samtools sort -n SRR31621${i}.b
am -o SRR31621${i}_nsorted.bam;done
6.2 2 reads计数,得到表达矩阵htseq-count
(用时很久很久,久到不忍写,8hrs)
注释文件如果已经解压,则不再需要重复
cd ../data/matrix
gunzip /mnt/f/rna_seq/data/reference/annotation/hg19/gencode.v19.annotation.gt.gz && rm -rf gencode.v19.annotation.gtf.gz
for ((i=2;i<=5;i++));do htseq-count -r name -f bam /mnt/f/rna_seq/aligned/SRR31621${i}_nsorted.bam /mnt/f/rna_seq/data/reference/annotation/hg19/gencode.v19.an
notation.gtf > SRR31621${i}.count; done
ls -al *.count
-rwxrwxrwx 1 root root 1197426 Aug 7 16:25 SRR316212.count
-rwxrwxrwx 1 root root 1186189 Aug 7 17:16 SRR316213.count
-rwxrwxrwx 1 root root 1200305 Aug 7 22:51 SRR316214.count
-rwxrwxrwx 1 root root 1187596 Aug 7 23:32 SRR316215.count
6.3 3 合并表达矩阵并进行基因名注释(R中进行)
Tumor:SRR316214,SRR316215
Adjacent Normal Liver:SRR316212,SRR316213
setwd("F:/rna_seq/data/matrix")
options(stringsAsFactors = FALSE)
control1<-read.table("SRR316212.count",sep = "\t",col.names = c("gene_id","control1"))
control2<-read.table("SRR316213.count",sep = "\t",col.names = c("gene_id","control2"))
treat1<-read.table("SRR316214.count",sep = "\t",col.names = c("gene_id","treat1"))
treat2<-read.table("SRR316215.count",sep = "\t",col.names = c("gene_id","treat2"))
raw_count <- merge(merge(control1, control2, by="gene_id"), merge(treat1, treat2, by="gene_id"))
head(raw_count)
raw_count_filt <- raw_count[-1:-5,]
ENSEMBL <- gsub("\\.\\d*", "", raw_count_filt$gene_id)
row.names(raw_count_filt) <- ENSEMBL
head(raw_count_filt)
> head(raw_count_filt)
gene_id control1 control2 treat2.x treat2.y
ENSG00000000003 ENSG00000000003.10 4004 781 756 756
ENSG00000000005 ENSG00000000005.5 1 0 0 0
ENSG00000000419 ENSG00000000419.8 776 140 165 165
ENSG00000000457 ENSG00000000457.9 624 144 240 240
ENSG00000000460 ENSG00000000460.12 260 52 105 105
ENSG00000000938 ENSG00000000938.8 161 59 16 16
7 DEseq2筛选差异表达基因并注释
这次我换了Annotation包进行注释
7.1 载入数据(countData和colData)
# 这一步很关键,要明白condition这里是因子,不是样本名称;小鼠数据有对照组和处理组,各两个重复
condition <- factor(c(rep("control",2),rep("treat",2)), levels = c("control","treat"))
condition
colData <- data.frame(row.names=colnames(mycounts), condition)
> colData
condition
control1 control
control2 control
treat1 treat
treat2 treat
7.2 构建dds对象,开始DESeq流程
dds <- DESeqDataSetFromMatrix(countData=mycounts,
colData=colData,
design= ~ condition)
dds = DESeq(dds)
7.3 总体结果查看
接下来,我们要查看treat versus control的总体结果,并根据p-value进行重新排序。利用summary命令统计显示一共多少个genes上调和下调(FDR0.1)
res = results(dds, contrast=c("condition", "control", "treat"))
res = res[order(res$pvalue),]
head(res)
summary(res)
write.csv(res,file="All_results.csv")
table(res$padj<0.01)
> summary(res)
out of 30981 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 3928, 13%
LFC < 0 (down) : 3283, 11%
outliers [1] : 0, 0%
low counts [2] : 10317, 33%
(mean count < 2)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
> table(res$padj<0.01)
FALSE TRUE
16192 4472
可见,上调基因和下调的数目都很多,padj<0.01的有4472,即使padj<0.001的也有3089个。
7.4 提取差异表达genes(DEGs)
先看下padj(p值经过多重校验校正后的值)小于0.01,表达倍数取以2为对数后大于1或者小于-1的差异表达基因。代码如下
diff_gene_deseq2 <-subset(res, padj < 0.01 & abs(log2FoldChange) > 1)
dim(diff_gene_deseq2)
head(diff_gene_deseq2)
write.csv(diff_gene_deseq2,file= "HCC_DEG_0.05_2.csv")
> dim(diff_gene_deseq2)
[1] 3780 6
> head(diff_gene_deseq2)
log2 fold change (MLE): condition control vs treat
Wald test p-value: condition control vs treat
DataFrame with 6 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000130649 78059.342 7.190959 0.2110024 34.07998 1.460506e-254 3.017990e-250
ENSG00000268925 5666.055 7.534595 0.2573299 29.27991 1.869559e-188 1.931628e-184
ENSG00000105697 8791.483 6.611356 0.2338215 28.27523 6.970959e-176 4.801597e-172
ENSG00000167244 8988.099 6.767261 0.2582403 26.20528 2.313301e-151 1.195051e-147
ENSG00000162366 4037.960 -7.076854 0.2704122 -26.17062 5.742578e-151 2.373293e-147
ENSG00000169715 3803.597 6.238596 0.2403373 25.95767 1.489813e-148 5.130914e-145
把padj<0.001,|log2FC|>1有2938个
8 探索分析结果:Data visulization
#MA plot
plotMA(res,ylim=c(-2,2))
topGene <- rownames(res)[which.min(res$padj)]
with(res[topGene, ], {
points(baseMean, log2FoldChange, col="dodgerblue", cex=6, lwd=2)
text(baseMean, log2FoldChange, topGene, pos=2, col="dodgerblue")
})
#shirnked
res_order<-res[order(row.names(res)),]
res = res_order
res.shrink <- lfcShrink(dds, contrast = c("condition","treat","control"), res=res)
plotMA(res.shrink, ylim = c(-5,5))
topGene <- rownames(res)[which.min(res$padj)]
with(res[topGene, ], {
points(baseMean, log2FoldChange, col="dodgerblue", cex=2, lwd=2)
text(baseMean, log2FoldChange, topGene, pos=2, col="dodgerblue")
})
#identify
idx <- identify(res$baseMean, res$log2FoldChange)
rownames(res)[idx]
#plotcounts
plotCounts(dds, gene=which.min(res$padj), intgroup="condition", returnData=TRUE)
plotCounts(dds, gene="ENSG00000130649", intgroup="condition", returnData=FALSE)
#boxplot
plotCounts(dds, gene="ENSG00000130649", intgroup="condition", returnData=TRUE) %>%
ggplot(aes(condition, count)) + geom_boxplot(aes(fill=condition)) + scale_y_log10() + ggtitle("CYP2E1")
#pointplot
d <- plotCounts(dds, gene="ENSG00000130649", intgroup="condition", returnData=TRUE)
ggplot(d, aes(x=condition, y=count)) +
geom_point(aes(color= condition),size= 4, position=position_jitter(w=0.5,h=0)) +
scale_y_log10(breaks=c(25,100,400))+ ggtitle("CYP2E1")
##3最小padj
d <- plotCounts(dds, gene=which.min(res$padj), intgroup="condition",
returnData=TRUE)
ggplot(d, aes(x=condition, y=count)) +
geom_point(position=position_jitter(w=0.1,h=0)) +
scale_y_log10(breaks=c(25,100,400))
#PCA
vsdata <- vst(dds, blind=FALSE)
plotPCA(vsdata, intgroup="condition")
#beatifule pca plot
pcaData <- plotPCA(vsdata, intgroup=c("condition"), returnData=TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
ggplot(pcaData, aes(PC1, PC2, color=condition, shape=condition)) +
geom_point(size=3) +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance")) +
coord_fixed()
library("pheatmap")
select<-order(rowMeans(counts(dds, normalized = TRUE)),
decreasing = TRUE)[1:20]
df <- as.data.frame(colData(dds)[,c("condition","sizeFactor")])
# this gives log2(n + 1)
ntd <- normTransform(dds)
pheatmap(assay(ntd)[select,], cluster_rows=FALSE, show_rownames=FALSE,
cluster_cols=FALSE, annotation_col=df)
#vsd
vsd <- vst(dds, blind=FALSE)
pheatmap(assay(vsd)[select,], cluster_rows=FALSE, show_rownames=FALSE,
cluster_cols=FALSE, annotation_col=df)
##sample to sample heatmap
sampleDists <- dist(t(assay(vsd)))
library("RColorBrewer")
sampleDistMatrix <- as.matrix(sampleDists)
rownames(sampleDistMatrix) <- paste(vsd$condition, vsd$type, sep="-")
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix,
clustering_distance_rows=sampleDists,
clustering_distance_cols=sampleDists,
col=colors)
这部分主要结果部分的一些可视化,具体参考详细的部分即可,只放一张图
9 DEGs的富集分析(功能注释)ClusterProfiler包
##enrichment analysis using clusterprofiler package created by yuguangchuang
library(clusterProfiler)
library(DOSE)
library(stringr)
library(org.Hs.eg.db)
#get the ENTREZID for the next analysis
sig.gene= diff_gene_deseq2
head(sig.gene)
gene<-rownames(sig.gene)
head(gene)
gene.df<-bitr(gene, fromType = "ENSEMBL",
toType = c("SYMBOL","ENTREZID"),
OrgDb = org.Hs.eg.db)
head(gene.df)
> head(gene.df)
ENSEMBL SYMBOL ENTREZID
1 ENSG00000130649 CYP2E1 1571
3 ENSG00000105697 HAMP 57817
4 ENSG00000167244 IGF2 3481
5 ENSG00000162366 PDZK1IP1 10158
6 ENSG00000169715 MT1E 4493
7 ENSG00000074276 CDHR2 54825
GO enrichment
#Go enrichment
ego_cc<-enrichGO(gene = gene.df$ENSEMBL,
OrgDb = org.Hs.eg.db,
keyType = 'ENSEMBL',
ont = "CC",
pAdjustMethod = "BH",
pvalueCutoff = 0.01,
qvalueCutoff = 0.05)
ego_bp<-enrichGO(gene = gene.df$ENSEMBL,
OrgDb = org.Hs.eg.db,
keyType = 'ENSEMBL',
ont = "BP",
pAdjustMethod = "BH",
pvalueCutoff = 0.01,
qvalueCutoff = 0.05)
barplot(ego_bp,showCategory = 18,title="The GO_BP enrichment analysis of all DEGs ")+
scale_size(range=c(2, 12))+
scale_x_discrete(labels=function(ego_bp) str_wrap(ego_bp,width = 25))
KEGG enrichment
library(stringr)
kk<-enrichKEGG(gene =gene.df$ENTREZID,
organism = 'hsa',
pvalueCutoff = 0.05)
> kk[1:30]
ID Description GeneRatio BgRatio
hsa04610 hsa04610 Complement and coagulation cascades 38/1407 79/7431
hsa04933 hsa04933 AGE-RAGE signaling pathway in diabetic complications 40/1407 99/7431
hsa00071 hsa00071 Fatty acid degradation 23/1407 44/7431
hsa04974 hsa04974 Protein digestion and absorption 35/1407 90/7431
hsa05146 hsa05146 Amoebiasis 36/1407 96/7431
hsa04978 hsa04978 Mineral absorption 23/1407 51/7431
hsa04115 hsa04115 p53 signaling pathway 29/1407 72/7431
hsa05144 hsa05144 Malaria 22/1407 49/7431
hsa00982 hsa00982 Drug metabolism - cytochrome P450 28/1407 72/7431
hsa00980 hsa00980 Metabolism of xenobiotics by cytochrome P450 29/1407 76/7431
hsa03320 hsa03320 PPAR signaling pathway 28/1407 74/7431
hsa05204 hsa05204 Chemical carcinogenesis 30/1407 82/7431
mykegg<-barplot(kk,showCategory = 20, title="The KEGG enrichment analysis of all DEGs")+
scale_size(range=c(2, 12))+
scale_x_discrete(labels=function(kk) str_wrap(kk,width = 25))
dotplot(kk,showCategory = 20, title="The KEGG enrichment analysis of all DEGs")+
scale_size(range=c(2, 12))+
scale_x_discrete(labels=function(kk) str_wrap(kk,width = 25))
用cowplot拼起来
library(cowplot)
plot_grid(mygobp,mykegg,labels = c("A","B"),align = "h")
10 KEGG信号通路的可视化(pathview包)
备注,前面详细部分用gageData重新进行了富集分析,这部分直接调用ClusterProfiler包的富集结果
- 选取enrichKEGG结果pvalue排名第一的hsa4610:Complement and coagulation cascades和排名第7的hsa04115:p53 signaling pathway
library("pathview")
foldchanges = sig.gene$log2FoldChange
names(foldchanges)= gene.df$ENTREZID
head(foldchanges)
pathview(gene.data = foldchanges, pathway.id = "hsa04610", species="hsa")
------------------------------------------------------------------------------
hsa4610:Complement and coagulation cascades
------------------------------------------------------------------------------
hsa04115:p53 signaling pathway
11 把counts结果,clusterprofiler部分的symbol name 和DEGs全部合并到一个表格
备注,这部分的主要问题是没用可以merge
用的ID,先看下
> head(mycounts)
control1 control2 treat1 treat2
ENSG00000000003 4004 781 4229 756
ENSG00000000005 1 0 0 0
ENSG00000000419 776 140 1180 165
ENSG00000000457 624 144 1271 240
ENSG00000000460 260 52 610 105
ENSG00000000938 161 59 57 16
> head(gene.df)
ENSEMBL SYMBOL ENTREZID
1 ENSG00000130649 CYP2E1 1571
3 ENSG00000105697 HAMP 57817
4 ENSG00000167244 IGF2 3481
5 ENSG00000162366 PDZK1IP1 10158
6 ENSG00000169715 MT1E 4493
7 ENSG00000074276 CDHR2 54825
> head(sig.gene)
log2 fold change (MLE): condition control vs treat
Wald test p-value: condition control vs treat
DataFrame with 6 rows and 6 columns
ENSEMBL log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000130649 78059.342 7.190959 0.2110024 34.07998 1.460506e-254 3.017990e-250
ENSG00000268925 5666.055 7.534595 0.2573299 29.27991 1.869559e-188 1.931628e-184
ENSG00000105697 8791.483 6.611356 0.2338215 28.27523 6.970959e-176 4.801597e-172
ENSG00000167244 8988.099 6.767261 0.2582403 26.20528 2.313301e-151 1.195051e-147
ENSG00000162366 4037.960 -7.076854 0.2704122 -26.17062 5.742578e-151 2.373293e-147
ENSG00000169715 3803.597 6.238596 0.2403373 25.95767 1.489813e-148 5.130914e-145
所以执行以下代码:
head(sig.gene)
head(gene.df)
ENSEMBL<-rownames(sig.gene)
sig.gene<-cbind(ENSEMBL, sig.gene)
colnames(sig.gene)[1]<-c("ENSEMBL")
head(mycounts)
ENSEMBL<-rownames(mycounts)
mycounts<-cbind(ENSEMBL, mycounts)
colnames(mycounts)[1]<-c("ENSEMBL")
merge1<-merge(sig.gene,mycounts,by="ENSEMBL")
merge2<-merge(merge1,gene.df, by="ENSEMBL")
head(merge2)
write.csv(DEG_symbole,file="hcc_DEGs_last_results.csv")
结果如下:
> head(merge2)
DataFrame with 6 rows and 13 columns
ENSEMBL baseMean log2FoldChange lfcSE stat pvalue padj control1 control2 treat1 treat2
<character> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <integer> <integer> <integer> <integer>
1 ENSG00000000938 70.51956 2.169409 0.4610591 4.705274 2.535248e-06 2.632581e-05 161 59 57 16
2 ENSG00000001460 77.93853 -1.496365 0.4247974 -3.522537 4.274371e-04 2.517116e-03 68 18 268 68
3 ENSG00000001561 382.90350 -1.466940 0.3397462 -4.317751 1.576269e-05 1.347065e-04 416 69 1780 227
4 ENSG00000001626 70.41775 4.713865 0.5812437 8.109965 5.063455e-16 1.944819e-14 226 61 16 1
5 ENSG00000001630 165.86393 -2.283850 0.4094980 -5.577194 2.444286e-08 3.665365e-07 84 28 442 207
6 ENSG00000002549 1683.37557 1.519578 0.2216429 6.855972 7.082930e-12 1.708432e-10 4166 1104 2171 481
SYMBOL ENTREZID
<character> <character>
1 FGR 2268
2 STPG1 90529
3 ENPP4 22875
4 CFTR 1080
5 CYP51A1 1595
6 LAP3 51056
至此,这部分结果可以用来做其他很多下游分析了。