前几天听了一个优秀本科生的学习WGCNA的分享,后浪果然够猛,自然也要好好学习啊!重新跑了一遍流程,复习一下。
参考学习资料:
首先是曾老师的教程:https://github.com/jmzeng1314/my_WGCNA
然后是:https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/index.html
WGCNA基本概念
- 加权基因共表达网络分析(WGCNA)是一种系统生物学方法,用于样品中基因之间的相关模式。WGCNA可用于查找高度相关的基因的簇(模块),使用
module eigengene
或intramodular hub gene
对这些簇进行汇总,将模块相互关联并与外部样本性状关联(使用eigengene
网络)方法),并用于计算module membership
度量。挑选模块内hub
基因,这些基因可以用于生物标志物或治疗靶标。 - 它与只挑选差异基因相比,WGCNA可以从成千上万的基因中挑选出高度相关的基因的簇(模块),并将模块与外部样本性状关联,找出与样本性状高度相关的模块。然后就可以进行模块内分析。
Co-expression network(共表达网络)
共表达网络定义为无向的、加权的基因网络。这样一个网络的节点对应于基因,基因之间的边代表基因表达量的相关性,加权是将相关性的绝对值提高到幂β≥1(软阈值),加权基因共表达网络的构建以牺牲低相关性为代价,强调高相关性。具体地说,表示无符号网络的邻接关系。
Module(模块)
模块是高度互连的基因簇。在无符号共表达网络中,;模块对应于具有高度相关的基因簇。在有符号网络中,,模块对应于正相关的基因。这里的加权的网络就等于邻接矩阵。通过幂邻接转换,就强化了高相关性基因的关系,弱化了相关性基因的关系。
Connectivity(连接度)
对于每个基因,连接性(也称为度)被定义为与其他基因的连接强度之和:在共表达网络中,连接度衡量一个基因与所有其他网络基因的相关性。
Intramodular connectivity(模块内连接度)
模块内链接度衡量给定基因相对于特定模块的基因的连接或共表达程度。模块内连接度可以做为Module membership的度量。
Module eigengene E
给定模块的第一主成分,代表整个模块的基因表达谱
Module Membership(MM)
对于每个基因,我们通过将其基因表达谱与模块的Module eigengene
相关性来定义Module Membership
。测量基因i与蓝色模块Module eigengene的相关性。如果MM blue(i)
接近0,则i-th
基因不是蓝色模块的一部分。另一方面,如果MM blue(i)
接近1或-1,则它与蓝色模块基因高度相关.MM标记编码基因与蓝色模块Module eigengene
之间是正相关还是负相关.
hub gene
高度连接基因的缩写,根据定义,它是共表达网络模块内具有高连接度的基因。
Gene significance(GS)
其中表示基因的表达谱,的绝对值越高,第基因的生物学意义就越大。
基本分析流程
- 数据输入和清洗
- 网络构建和模块检测
- 量化模块和样本性状的关系
- 挑出感兴趣模块内部的基因
- 可视化TOM矩阵
- 将网络导出到外部数据进行可视化
1.数据分析的常见问题
1. 需要多少个样本?
不建议对少于15个样本的数据集尝试WGCNA。与其他分析方法一样,更多的样品通常会导致更可靠和更精确的结果。
2. 如何过滤掉探针?
探针集或基因可以通过均值、绝对中位差(MAD)或方差进行过滤,因为低表达或不变的基因通常代表噪声。用均值表达还是方差过滤是否更好尚有争议,两者都有优缺点。不建议通过差异分析过滤掉基因。
3. 除了芯片数据,是否可以用RNA-seq数据进行WGCNA分析?
- 使用(正确归一化的)RNA-seq数据与使用(正确归一化的)微阵列数据并没有什么不同。
- 只要使用相同的方式处理所有样本,无论是使用RPKM,FPKM还是简单的归一化计数,对于WGCNA分析都不会产生很大的不同。
- 如果数据来自不同批次,需要去除批次效应。
4. 挑选软阈值的问题?
如果合理的阈值(无符号或有符号的混合网络,小于15,有符号的网络,小于30)不能使无尺度拓扑网络系数高于0.8,或者或平均连接度降到100以下。可能是由于批次效应,生物学异质性(例如,由来自2个不同组织的样品组成的数据集)或条件之间的强烈变化(例如按时间序列表示)而导致的。应该仔细调查是否存在样本异质性,驱动异质性的原因以及是否应调整数据等. 如果事实证明由一个不想删除的有趣的生物学变量引起的(即调整数据),则可以根据样本数量选择适当的软阈值如下表所示。
Number of samples | Unsigned and signed hybrid networks | Signed networks |
---|---|---|
Less than 20 | 9 | 18 |
20-30 | 8 | 16 |
30-40 | 7 | 14 |
more than 40 | 6 | 12 |
2. 分析流程
安装必要的包
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("WGCNA")
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("clusterProfiler","org.Mm.eg.db"))
install.packages("ggplot2")
install.packages("gplots")
2.1 数据输入、清洗、预处理:得到一个行为样本,列为基因的表达矩阵,另一个是样本对应临床特征的矩阵
# Step1. Data input and cleaning:
#========================================================
# Code chunk Step1.1
#========================================================
# Display the current working directory
rm(list = ls())
getwd();
library(dplyr)
# Load the WGCNA package
library(WGCNA);
# The following setting is important, do not omit.
options(stringsAsFactors = FALSE);
#Read in the fpkm data set
library(readxl)
a <- read_excel("GSE98622_mouse-iri-master.xlsx")
class(a)
a=as.data.frame(a)
head(a[,1:6])
rownames(a)=a[,1]
table(a$type)
dat <- a%>%filter(type=="protein_coding")#筛选编码蛋白的探针
probe2sym=dat[,1:3]#提取探针和基因symbol的信息
colnames(probe2sym)=c('probe', "symbol","type" )
rownames(dat) <- dat[,1]
dat=dat[,4:ncol(dat)]
dat <- dat%>%select(starts_with("IRI"))#只关注自己感兴趣的样本例如:只关注缺血再灌注的样本
dim(dat)
write.csv(probe2sym,file = 'anno_probe2sym.csv')
# Take a quick look at what is in the data set:
names(dat);
boxplot(dat[,1:10],las=2)#画10个样本的箱线图看一下基因的表达情况
dat <- log2(dat+1)#将fpkm的数据进行log转换
boxplot(dat[,1:10],las=2)
#===========================================================
# Code chunk Step1.2 匹配数据和筛选 goodSamplesGenes
#===========================================================
## 因为WGCNA针对的是基因进行聚类,而一般我们的聚类是针对样本用hclust即可,所以这个时候需要转置
datExpr0 <- t(dat[order(apply(dat,1,mad), decreasing = T)[1:5000],])# top 5000 mad genes
datExpr <- datExpr0
#我们首先检查有太多缺失值的基因和样本
gsg = goodSamplesGenes(datExpr0, verbose = 3);
gsg$allOK
#如果最后一条语句返回TRUE,则所有基因都通过了就是OK的。如果没有,我们就从数据中删除有问题的基因和样本:
if (!gsg$allOK){
# Optionally, print the gene and sample names that were removed:
if (sum(!gsg$goodGenes)>0)
printFlush(paste("Removing genes:", paste(names(datExpr0)[!gsg$goodGenes], collapse = ", ")));
if (sum(!gsg$goodSamples)>0)
printFlush(paste("Removing samples:", paste(rownames(datExpr0)[!gsg$goodSamples], collapse = ", ")));
# Remove the offending genes and samples from the data:
datExpr0 = datExpr0[gsg$goodSamples, gsg$goodGenes]
}
#====================================================
# Code chunk Step1.3 画层次聚类树,查看是否有离群的样本
#====================================================
#接下来,我们对样本进行聚类(与稍后对基因进行聚类形成对比),以查看是否有明显的异常值
sampleTree = hclust(dist(datExpr0), method = "average");
# Plot the sample tree: Open a graphic output window of size 12 by 9 inches
# The user should change the dimensions if the window is too large or too small.
sizeGrWindow(12,9)
par(cex = 0.6);
par(mar = c(0,4,2,0))
#png("Step1-sampleClustering.png",width = 800,height = 600)
plot(sampleTree, main = "Sample clustering to detect outliers", sub="", xlab="", cex.lab = 1.5,
cex.axis = 1.5, cex.main = 2)+abline(h =75 , col = "red")
dev.off()
#====================================================
# Code chunk Step1.4 如果有离群样本就设置abline的h的值
#====================================================
# Plot a line to show the cut
#abline(h = 70000, col = "red");#从数据上看聚类的还可以,不需要剔除样本所以修改下参数“ h = ”
# Determine cluster under the line
clust = cutreeStatic(sampleTree, cutHeight = 80, minSize = 10)#不需要剔除样本所以修改下参数“ cutHeight = ”
table(clust)
# clust == 1 包含了我们需要的样本
keepSamples = (clust==1)
datExpr = datExpr0[keepSamples, ]
nGenes = ncol(datExpr)
nSamples = nrow(datExpr)
#变量datExpr现在包含可用于网络分析的表达式数据。
#====================================================
# Code chunk Step1.5 画样本的层次聚类树 和trait的热图
#====================================================
#读取临床文件
datTraits=read.table("datTraits.txt",sep = "\t",header = T,check.names = F)
datTraits[1:4,1:4]
## 下面主要是为了防止临床表型与样本名字对不上
datTraits <- datTraits[match(rownames(datExpr),rownames(datTraits)),]
identical(rownames(datTraits),rownames(datExpr))
# Re-cluster samples
sampleTree2 = hclust(dist(datExpr), method = "average")
# 将样本用颜色表示:在图2所示的曲线图中,白色表示低值,红色表示高值,灰色表示缺少条目
#如果是连续性变量会是渐变色,如果是‘0’,‘1’的数据将会是红白相间
traitColors = numbers2colors(datTraits, signed = FALSE);
# Plot the sample dendrogram and the colors underneath.
sizeGrWindow(12,9)
par(cex = 0.6);
par(mar = c(0,4,2,0))
#png("Step1-Sample dendrogram and trait heatmap.png",width = 800,height = 600)
plotDendroAndColors(sampleTree2, traitColors,
groupLabels = names(datTraits),
main = "Sample dendrogram and trait heatmap",)
dev.off()
datExpr=as.data.frame(datExpr)
save(datExpr, datTraits, file = "WGCNA-01-dataInput.RData")
选取 mad 前5000的基因
2.2 GSE98622共有49个样本,提取里面为IRI的 31个样本的表达矩阵。
2.2.1 临床trait有:
-
datTraits
:为所有不同时间的IRI组, -
days
组:为IRI 组缺血48h,72h的样本 -
hours
组:为IRI 组缺血2h,4h,24h的样本 -
months
组:为IRI 组缺血6m,12m的样本 -
weeks
组:为IRI 组缺血7d,14d,28d的样本
2.3 得到样本聚类树和临床trait的热图
2.4 一步构建网络和筛选软阈值(power)没有得到合适的阈值,所有使用R包作者提供的经验阈值7
#====================================================
# Code chunk Step2.1.1 自动化一步自动网络构建和模块检测
#====================================================
# Display the current working directory
rm(list = ls())
getwd();
# Load the WGCNA package
library(WGCNA)
# The following setting is important, do not omit.
options(stringsAsFactors = FALSE);
# Allow multi-threading within WGCNA. This helps speed up certain calculations.
# At present this call is necessary for the code to work.
# Any error here may be ignored but you may want to update WGCNA if you see one.
# Caution: skip this line if you run RStudio or other third-party R environments.
# See note above.
if (Sys.info()["sysname"]=="Darwin" ) {
allowWGCNAThreads(nThreads = 2)
} else{
enableWGCNAThreads(nThreads = 2)
#disableWGCNAThreads()
}
# Load the data saved in the first part
lnames = load(file = "WGCNA-01-dataInput.RData");
#The variable lnames contains the names of loaded variables.
lnames
#====================================================
# Code chunk Step2.1.2 选择合适的软阈值
#====================================================
# Choose a set of soft-thresholding powers
powers = c(c(1:10), seq(from = 12, to=20, by=1))
# Call the network topology analysis function
sft = pickSoftThreshold(datExpr, powerVector = powers, verbose = 5)
# Plot the results:
sizeGrWindow(9, 5)
png("step2-beta-value.png",width = 800,height = 600)
par(mfrow = c(1,2));
cex1 = 0.9;#图里面字体大小
# Scale-free topology fit index as a function of the soft-thresholding power
plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n",
main = paste("Scale independence"));
text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
labels=powers,cex=cex1,col="red")+
abline(h=0.70,col="red")# this line corresponds to using an R^2 cut-off of
# Mean connectivity as a function of the soft-thresholding power
plot(sft$fitIndices[,1], sft$fitIndices[,5],
xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
main = paste("Mean connectivity"))
text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=cex1,col="red")+abline(h=100,col="red")
dev.off()
sft$powerEstimate #自动评估的软阈值 默认的 R^2 cut-off是0.85
#====================================================
# Code chunk Step2.1.3 #一步法完成网络构建
#====================================================
net = blockwiseModules(datExpr, power = 7,#power需要进行修改
TOMType = "unsigned", minModuleSize = 30,
reassignThreshold = 0, mergeCutHeight = 0.25,
numericLabels = TRUE, pamRespectsDendro = FALSE,
saveTOMs = FALSE,
saveTOMFileBase = "WGCNA TOM",
verbose = 3)
#参数mergeCutHeight是模块合并的阈值。net$Colors包含模块赋值,net$MES包含模块的模块特征(module eigengenes of the modules)。
table(net$colors)
#并表明有16个模块,按照大小从大到小的顺序标记为0到15。标签0是为所有模块之外的基因保留的
#用于模块标识的层次聚类树形图(树)在net$Dendrogram[[1]];$中返回。
#====================================================
# Code chunk Step2.1.4 画genes-modules的图
#====================================================
# open a graphics window
sizeGrWindow(12, 9)
# Convert labels to colors for plotting
mergedColors = labels2colors(net$colors)
# Plot the dendrogram and the module colors underneath
png("Step2-genes-modules.png",width = 800,height = 500)
plotDendroAndColors(net$dendrograms[[1]], mergedColors[net$blockGenes[[1]]],
"Module colors",
dendroLabels = FALSE, hang = 0.03,
addGuide = TRUE, guideHang = 0.05)
dev.off()
#====================================================
# Code chunk Step2.1.5 #保存MEs, moduleLabels, moduleColors, geneTree
#====================================================
moduleLabels = net$colors
moduleColors = labels2colors(net$colors)#模块里每个基因的颜色
MEs = net$MEs;
geneTree = net$dendrograms[[1]];#用于模块标识的层次聚类树形图(树)保存在geneTree里面
save(MEs, moduleLabels, moduleColors, geneTree,
file = "WGCNA-02-networkConstruction-auto.RData")
2.5 模块与临床trait的关系
2.6 挑选与感兴趣临床特征的模块
2.7 对模块内的基因的进行GO富集分析
2.8 将挑选出的基因在cytoscape进行可视化,通过cytohubba插件进行可视化
3.结果
3.1 样本聚类树和临床特征heatmap
解读:基于样本的欧几里德距离以及性状 的颜色指示(0=白色,1=红色)的聚类树状图(如果是连续性变量:表达量从低到高,颜色从白色过渡到红色,灰色代表缺失值。)。顺序和聚类树中的顺序一致,编码特征值的颜色与分支不一致,这表明trait=0的样本与trait=1的样本不是“全局不同的”。
3.2 软阈值的挑选:挑选软阈值是为了构建无尺度network,使node的Mean Connectivity接近于0,没有得到合适的软阈值,所以使用经验阈值7。
解读:左图:各种软阈值(power)的网络拓扑分析。左面板显示无尺度拟合指数(y轴)作为软阈值(power)(x轴)的函数。右侧面板显示作为软阈值(power)(x轴)函数的平均连接性(度数,y轴)。
3.3 模块的构建
# Step3. Relating modules to external clinical traits and identifying important genes
#====================================================
# Code chunk Step3.1
#====================================================
# Display the current working directory
rm(list = ls())
getwd();
# Load the WGCNA package
library(WGCNA)
# The following setting is important, do not omit.
options(stringsAsFactors = FALSE);
# Load the expression and trait data saved in the first part
lnames = load(file = "WGCNA-01-dataInput.RData");
#The variable lnames contains the names of loaded variables.
lnames
# Load network data saved in the second part.
lnames = load(file = "WGCNA-02-networkConstruction-auto.RData");
lnames
#====================================================
# Code chunk Step3.2 计算MEs
#====================================================
# Define numbers of genes and samples
nGenes = ncol(datExpr);#定义基因和样本的数量
nSamples = nrow(datExpr);
# Recalculate MEs with color labels
MEs0 = moduleEigengenes(datExpr, moduleColors)$eigengenes
MEs = orderMEs(MEs0)#不同颜色的模块的ME值矩 (样本vs模块)
moduleTraitCor = cor(MEs, datTraits, use = "p");#计算模块与临床数据的相关性 行为样本,列为ME与临床特征的关系
moduleTraitPvalue = corPvalueStudent(moduleTraitCor, nSamples);
#====================================================
# Code chunk Step3.3 画模块和临床trait的关系图
#====================================================
sizeGrWindow(15,20)
# Will display correlations and their p-values
textMatrix = paste(signif(moduleTraitCor, 2), "\n(",
signif(moduleTraitPvalue, 1), ")", sep = "");
dim(textMatrix) = dim(moduleTraitCor)
png("step3-Module-trait-relationships.png",width = 1500,height = 1200,res = 130)
par(mar= c(5,8, 2, 2));
# Display the correlation values within a heatmap plot
labeledHeatmap(Matrix = moduleTraitCor,
xLabels = names(datTraits),
yLabels = names(MEs),
ySymbols = names(MEs),
colorLabels = FALSE,
colors = blueWhiteRed(50),#WGCNA提醒greenWhiteRed不适合红绿色盲,建议用blueWhiteRed
textMatrix = textMatrix,
setStdMargins = FALSE,
cex.text = 0.5,
zlim = c(-1,1),
main = paste("Module-trait relationships"))
dev.off()
#分析确定了几个重要的模块-性状关联。我们将把重点放在hour这一感兴趣的特征上
#====================================================
# Code chunk Step3.4 基因与性状和重要模块的关系:GS和MM
#====================================================
#GS: as(the absolute value of) the correlation between the gene and the trait
#MM: as the correlation of the module eigengene and the gene expression profile. This allows us to quantify the similarity of all genes on the array to every module.
# Define variable hour containing the hour column of datTrait
months = as.data.frame(datTraits$months);
names(months) = "months"
# names (colors) of the modules
modNames = substring(names(MEs), 3)
geneModuleMembership = as.data.frame(cor(datExpr, MEs, use = "p"));
MMPvalue = as.data.frame(corPvalueStudent(as.matrix(geneModuleMembership), nSamples));
#在列名上加MM,p.MM
names(geneModuleMembership) = paste("MM", modNames, sep="");
names(MMPvalue) = paste("p.MM", modNames, sep="");
geneTraitSignificance = as.data.frame(cor(datExpr, months, use = "p"));#修改临床特征hour
GSPvalue = as.data.frame(corPvalueStudent(as.matrix(geneTraitSignificance), nSamples));
names(geneTraitSignificance) = paste("GS.", names(months), sep="");#修改临床特征hour
names(GSPvalue) = paste("p.GS.", names(months), sep="");#修改临床特征hour
#====================================================
# Code chunk Step3.5 模块内分析:作模块membership和genesignificance的相关图
#====================================================
module = "purple"
column = match(module, modNames);
moduleGenes = moduleColors==module;
sizeGrWindow(7, 7);
png("step3-Module_membership-gene_significance.png",width = 800,height = 600)
par(mfrow = c(1,1));
verboseScatterplot(abs(geneModuleMembership[moduleGenes, column]),
abs(geneTraitSignificance[moduleGenes, 1]),
xlab = paste("Module Membership in", module, "module"),
ylab = "Gene significance for months",
main = paste("Module membership vs. gene significance\n"),
cex.main = 1.2, cex.lab = 1.2, cex.axis = 1.2, col = module)
dev.off()
#图表如所示。显然,GS和MM是高度相关的,
#说明与某个性状高度显著相关的基因通常也是与该性状相关的模块中最重要的(中心)元素。
#====================================================
# Code chunk Step3.5.1 计算模块内连接度
#====================================================
adjacency = adjacency(datExpr, power =7);#计算一个邻接矩阵
Alldegrees=intramodularConnectivity(adjacency, moduleColors)
#a data frame with 4 columns giving the totalconnectivity(kTotal ),
#intramodular connectivity(kWithon ), extra-modular connectivity(kOut ),
#and the difference of the intraandextra-modular connectivities for all genes(kDiff );
#otherwise a vector of intramodular connectivities,
class(Alldegrees)
module = "purple"; # Select module
probes = names(datExpr) # Select module probes
inModule = (moduleColors==module);
modProbes = probes[inModule];
length(modProbes)
KIM_module=Alldegrees[modProbes,]
#====================================================
# Code chunk Step3.5.2 展示模块的热图和eigengene
#====================================================
#我们现在创建一个解释模块(heatmap)和相应module eigengene(barplot)之间关系的图:
sizeGrWindow(8,7);
which.module="purple"
ME=MEs[, paste("ME",which.module, sep="")]
par(mfrow=c(2,1), mar=c(0.3, 5.5, 3, 2))
plotMat(t(scale(datExpr[,moduleColors==which.module ]) ),
nrgcols=30,rlabels=F,rcols=which.module,
main=which.module, cex.main=2)
par(mar=c(5, 4.2, 0, 0.7))
barplot(ME, col=which.module, main="", cex.main=2,
ylab="eigengene expression",xlab="array sample")
#最上面一行显示了粉色模块基因(行)在微阵列(列)中的位置。下一行显示与相同微阵列样本相对应的模块特征基因表达值(y轴)。注意,ME在许多模块基因表达不足的阵列中呈现低值(热图中为绿色)。ME对于很多模块基因过度表达的阵列具有很高的值(热图中为红色)。ME可以被认为是该模块最具代表性的基因表达谱。
#====================================================
# Code chunk Step3.6 探针名转为基因名
#====================================================
names(datExpr)[1:10]
tail(names(datExpr)[moduleColors=="purple"])
annot = read.csv(file = "anno_probe2sym.csv",row.names = 1);
dim(annot)
names(annot)
probes = names(datExpr)
probes2annot = match(probes, annot$probe)
# The following is the number or probes without annotation:
sum(is.na(probes2annot))
# Should return 0.
#====================================================
# Code chunk Step3.7 创建一个数据框,其中包含所有探针的以下信息:
#探针ID、基因符号、module color, gene significance for weight, and module membership and p-values in all modules.
#模块将按其' hour '重要性排序,最重要的模块位于左侧。
#====================================================
# Create the starting data frame
geneInfo0 = data.frame(probe = probes,#需要自己进行修改
geneSymbol = annot$symbol[probes2annot],
type = annot$type[probes2annot],
moduleColor = moduleColors,
geneTraitSignificance,
GSPvalue)
# Order modules by their significance for ‘hour’
modOrder = order(-abs(cor(MEs,months, use = "p"))); # 修改特征参数‘hour’
# Add module membership information in the chosen order
for (mod in 1:ncol(geneModuleMembership))
{
oldNames = names(geneInfo0)
geneInfo0 = data.frame(geneInfo0, geneModuleMembership[, modOrder[mod]],
MMPvalue[, modOrder[mod]]);
names(geneInfo0) = c(oldNames, paste("MM.", modNames[modOrder[mod]], sep=""),
paste("p.MM.", modNames[modOrder[mod]], sep=""))
}
# Order the genes in the geneInfo variable first by module color, then by geneTraitSignificance
geneOrder = order(geneInfo0$moduleColor, -abs(geneInfo0$GS.months)); # 修改特征参数‘hour’
geneInfo = geneInfo0[geneOrder, ]
write.csv(geneInfo, file = "geneInfo.csv")
解读:挑选出来的基因聚类树状图,聚类时的距离为1-TOM值,基于拓扑重叠的不同,以及指定的模块颜色。
3.4 量化模块和临床性状的关联
- 模块与性状关系图。每行对应一个模块特征基因,每列对应一个性状。每个单元格包含相应的相关性和p值。根据颜色图例通过相关性对表进行颜色编码。
- 通过将基因显着性GS定义为基因和性状之间的相关性(的绝对值),我们可以量化单个基因与我们感兴趣的性状(权重)的关联。对于每个模块,我们还将模块成员数MM的定量度量定义为模块本征基因与基因表达谱的相关性。这使我们能够量化阵列上所有基因与每个模块的相似性。
3.5 将表观数据纳入ME,统一制作ME相关性的热图: 选择自己感兴趣的临床特征
解读:可视化的eigengene网络表示模块之间的关系和临床性状。模块邻接关系的热图,红色表示高邻接(正相关),蓝色表示低邻接(负相关)。
3.6 模块内分析:选择模块,作Module membership和 Gene significance的相关图
解读:从图中可看出 GS和MM是高度相关的,说明与某个性状高度显著相关的基因通常也是与该性状相关的模块中最重要的(中心)元素。g感兴趣模块中权重对Module membership(MM)的Gene significance(GS)散点图。在这个模块中,GS和MM之间存在着极显著的相关性。
3.7 对模块内的基因进行GO分析
#Step4.Interfacing network analysis with other data such as GO and KEGG
#====================================================
# Code chunk Step4.1 主要是关心具体某个模块内部的基因
#====================================================
# Display the current working directory
rm(list = ls())
getwd();
# Load the WGCNA package
library(WGCNA)
# The following setting is important, do not omit.
options(stringsAsFactors = FALSE);
if (Sys.info()["sysname"]=="Darwin" ) {
allowWGCNAThreads(nThreads = 4)
} else{
enableWGCNAThreads(nThreads = 2)
#disableWGCNAThreads()
}
# Load the expression and trait data saved in the first part
lnames = load(file ="WGCNA-01-dataInput.RData");#加载进来这里的我的表达矩阵变成了matrix,将其转为data.frame 才不会报错
#The variable lnames contains the names of loaded variables.
lnames
# Load network data saved in the second part.
lnames = load(file = "WGCNA-02-networkConstruction-auto.RData");
lnames
# Select module
module = "purple";
# Select module probes
probes = colnames(datExpr) #我们例子里面的probe就是基因
inModule = (moduleColors==module);
table(inModule)
modProbes = probes[inModule]; #模块内的基因已经挑了出来,可以用Y叔的包画图了
head(modProbes)
annot = read.csv(file = "anno_probe2sym.csv",row.names = 1);
modGenes = annot$symbol[match(modProbes, annot$probe)];
#====================================================
# Code chunk Step4.2 GO分析,kEGG分析
#====================================================
library(clusterProfiler)
library(org.Mm.eg.db)
library(ggplot2)
modGenes=as.data.frame(modGenes)
mod_entrez= mapIds(x = org.Mm.eg.db,
keys = modGenes$modGenes,
keytype = "SYMBOL",
column = "ENTREZID")
length(mod_entrez)
mod_entrez =na.omit(mod_entrez)#去除没有ENTREZ id 的基因,
length(mod_entrez)
#对基因集进行富集分析。给定一个基因载体,该函数将在FDR控制后返回富集GO分类。
go <- enrichGO(gene = mod_entrez, #a vector of entrez gene id
keyType = "ENTREZID",#输入基因的型
OrgDb = "org.Mm.eg.db", #组织数据库,bioconductor里面有人,鼠等
ont="all",
pvalueCutoff = 0.5,
qvalueCutoff = 0.5,
readable = TRUE)#whether mapping gene ID to gene Name
par(mar= c(3,4, 2, 2));
png("GO_all.png",width = 1500,height = 1200,res = 130)# 尝试随便命名
dotplot(go, split="ONTOLOGY") + facet_grid(ONTOLOGY~., scale="free")
dev.off()
#### 查看得到的结果 这里好像有个参数可以直接返回,等有空了去仔细看看这个R包
go_result=go@result
write.csv(go_result,file = 'go_all_result.csv')
# Step5.Network visualization using WGCNA functions
#====================================================
# Code chunk Step5.1 可视化 TOM矩阵,WGCNA的标准配图
#====================================================
# Display the current working directory
rm(list = ls())
getwd();
# Load the WGCNA package
library(WGCNA)
library(gplots)
# The following setting is important, do not omit.
options(stringsAsFactors = FALSE);
if (Sys.info()["sysname"]=="Darwin" ) {
allowWGCNAThreads(nThreads = 4)
} else{
enableWGCNAThreads(nThreads = 2)
#disableWGCNAThreads()
}
# Load the expression and trait data saved in the first part
lnames = load(file = "WGCNA-01-dataInput.RData");
#The variable lnames contains the names of loaded variables.
lnames
# Load network data saved in the second part.
lnames = load(file = "WGCNA-02-networkConstruction-auto.RData");
lnames
nGenes = ncol(datExpr)
nSamples = nrow(datExpr)
# Calculate topological overlap anew: this could be done more efficiently by saving the TOM
# calculated during module detection, but let us do it again here.
load(file="TOMsimilarityFromExpr.Rdata")
if (F) {
TOM = TOMsimilarityFromExpr(datExpr, power = 7);#前面的power为21
dissTOM = 1-TOM;#前面估计的power为7
save(TOM,file="TOMsimilarityFromExpr.Rdata")
# Transform dissTOM with a power to make moderately strong connections more visible in the heatmap
}
plotTOM = dissTOM^7;
# Set diagonal to NA for a nicer plot
diag(plotTOM) = NA;
# Call the plot function
if (T) {
sizeGrWindow(9,9)
png("step5-Network-heatmap.png",width = 800,height = 600)
myheatcol = colorpanel(250,'red',"orange",'lemonchiffon')
TOMplot(plotTOM, geneTree, moduleColors, main = "Network heatmap plot, all genes", col = myheatcol)
#TOMplot(plotTOM, geneTree, moduleColors, main = "Network heatmap plot, all genes")
dev.off()
}
#使用热图将基因网络可视化。热图描绘了分析中所有基因之间的拓扑重叠矩阵(TOM)。
#浅色表示低重叠,逐渐变深的红色表示高重叠。沿对角线的深色块是模块。
#左侧和顶部还显示了基因树状图和模块分配
#====================================================
# Code chunk 3
#====================================================
nSelect = 400
# For reproducibility, we set the random seed
set.seed(10);
select = sample(nGenes, size = nSelect);
selectTOM = dissTOM[select, select];
# There's no simple way of restricting a clustering tree to a subset of genes, so we must re-cluster.
selectTree = hclust(as.dist(selectTOM), method = "average")
selectColors = moduleColors[select];
# Open a graphical window
sizeGrWindow(9,9)
# Taking the dissimilarity to a power, say 10, makes the plot more informative by effectively changing
# the color palette; setting the diagonal to NA also improves the clarity of the plot
plotDiss = selectTOM^7;
diag(plotDiss) = NA;
myheatcol = colorpanel(250,'red',"orange",'lemonchiffon')
TOMplot(plotDiss, selectTree, selectColors, main = "Network heatmap plot, selected genes",col = myheatcol)
#====================================================
# Code chunk Step5.3 将性状数据纳入ME,统一制作ME相关性的热图: 选择自己感兴趣的临床特征
#====================================================
#重新计算模块MEs
# Recalculate module eigengenes
MEs = moduleEigengenes(datExpr, moduleColors)$eigengenes
# Isolate 'hour' from the clinical traits
months = as.data.frame(datTraits$months);
names(months) = "months"
# hour加⼊入MEs成为MET
# Add the hour to existing module eigengenes
MET = orderMEs(cbind(MEs, months))
# Plot the relationships among the eigengenes and the trait
sizeGrWindow(5,10);
par(cex = 0.9)
png("step5-Eigengene-dendrogram.png",width = 800,height = 600)
par(margin(3,6,2,2))
plotEigengeneNetworks(MET, "Eigengene dendrogram and adjacency heatmap" ,
marDendro = c(0,4,1,2),
marHeatmap = c(5,5,1,2) ,
cex.lab = 0.8, xLabelsAngle= 90)
dev.off()
#该函数生成ME和性状的树状图,以及它们之间关系的热图。显示了 eigengenes的层次聚类树图由dissimilarity of eigengenes EI构成,Ej由1−cor(Ei,Ej)给出,
#下面的热图显示eigengenes的邻接值,AIj=(1+COR(Ei,Ej))/2。
#要拆分树状图和热图,我们可以使用以下代码
#====================================================
# Code chunk Step5.5 拆分树状图和热图
#====================================================
# Plot the dendrogram
sizeGrWindow(6,6);
par(cex = 1.0)
# 模块的进化树
#png("step5-Eigengene-dendrogram-hclust.png",width = 800,height = 600)
plotEigengeneNetworks(MET, "Eigengene dendrogram", marDendro = c(0,4,2,0),
plotHeatmaps = FALSE)
dev.off()
# Plot the heatmap matrix (note: this plot will overwrite the dendrogram plot)
par(cex = 1.0)
par(margin(3,6,2,2))
# 性状与模块热图
#png("step5-Eigengene-adjacency-heatmap.png",width = 800,height = 600)
plotEigengeneNetworks(MET, "Eigengene adjacency heatmap", marHeatmap = c(5,4,2,2),
plotDendrograms = FALSE, xLabelsAngle = 90)
dev.off()
#可视化的特征基因网络表示模块和临床性状之间的关系。
对基因的描述一般从三个层面进行:
- Cellular component解释的是基因存在在哪里,在细胞质还是在细胞核?如果存在细胞质那在哪个细胞器上?如果是在线粒体中那是存在线粒体膜上还是在线粒体的基质当中?这些信息都叫Cellular component。
- Biological process是在说明该基因参与了哪些生物学过程,比如,它参与了rRNA的加工或参与了DNA的复制,这些信息都叫Biological process
- Molecular function在讲该基因在分子层面的功能是什么?它是催化什么反应的?立足于这三个方面,我们将得到基因的注释信息。
3.8 拓扑重叠矩阵的热图
- 使用热图将基因网络可视化。热图描绘了分析中所有基因之间的拓扑重叠矩阵(TOM)。浅色表示低重叠,逐渐变深的红色表示高重叠。沿对角线的深色块是模块。左侧和顶部还显示了基因树状图和模块分配
# Step6.Export of networks to external software
#====================================================
# Code chunk Step6.1
#====================================================
# Display the current working directory
getwd();
rm(list = ls())
# Load the WGCNA package
library(WGCNA)
# The following setting is important, do not omit.
options(stringsAsFactors = FALSE);
enableWGCNAThreads()
# Load the expression and trait data saved in the first part
lnames = load(file = "WGCNA-01-dataInput.RData");
#The variable lnames contains the names of loaded variables.
lnames
# Load network data saved in the second part.
lnames = load(file = "WGCNA-02-networkConstruction-auto.RData");
lnames
#====================================================
# Code chunk Step6.2
#====================================================
# Recalculate topological overlap if needed
load(file="TOMsimilarityFromExpr.Rdata")
#TOM = TOMsimilarityFromExpr(datExpr, power = 7);#前面的power为7
# Read in the annotation file
annot = read.csv(file = "anno_probe2sym.csv",row.names = 1);
# Select modules
module = c("purple");###pink 对应 hour
# Select module probes
datExpr=as.data.frame(datExpr)
probes = names(datExpr)
inModule = (moduleColors==module);
table(moduleColors)
table(inModule)
modProbes = probes[inModule];
modGenes = annot$symbol[match(modProbes, annot$probe)];
# Select the corresponding Topological Overlap
modTOM = TOM[inModule, inModule];
dimnames(modTOM) = list(modProbes, modProbes)
# Export the network into edge and node list files Cytoscape can read
cyt = exportNetworkToCytoscape(modTOM,
edgeFile = paste("CytoscapeInput-edges-", paste(module, collapse="-"), ".txt", sep=""),
nodeFile = paste("CytoscapeInput-nodes-", paste(module, collapse="-"), ".txt", sep=""),
weighted = TRUE,
threshold = 0.02,
nodeNames = modProbes,
altNodeNames = modGenes,
nodeAttr = moduleColors[inModule])
总结:
通过对数据集进行WGCNA进行分析,筛选出了一些想要的hub gene,可以对这些基因检测,做相关的验证的实验。如果是肿瘤,可以去TCGA数据库做预后分析等。
在对hub基因进行挑选的时候,也可以通过模块内具有高度Connectivity的进行筛选。