版权所有,转载必究!
本文作者:任红雷, 联系邮箱: renhongleiz@126.com
本文整理自个人汇报的PPT
首先,本文将结合代码讲解常用的Affymetrix 芯片数据处理流程
以GSE11787为例,数据为关于副猪嗜血杆菌对猪炎症发生影响的数据,请下载数据,并将其解压到任意的文件夹,以供之后使用:
http://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE11787&format=file
首先介绍一下关于基因表达芯片的背景。(最后附上流程代码)
1.提纲
2.中心法则(可自行百度)
3.DNA 芯片是什么,作用如何?
5.Affymetrix 芯片
6.芯片上的信息
7.PM和MM
8.芯片扫描结果
9..CEL文件格式
10.intensity(信号强度)的实际含义
11..CEL文件以及.CDF文件
12.数据处理流程
13.拓展学习书籍列表
(本教程内容整理自Bioconductor Case Studies)
1.A (very) short introduction to R
https://cran.r-project.org/doc/contrib/Torfs+Brauer-Short-R-Intro.pdf
2.An Introduction to R
https://cran.r-project.org/doc/manuals/R-intro.pdf
3.Bioconductor Case Studies
lots of bioconductor workflows with source codes
链接: http://pan.baidu.com/s/1qYBtvIk
密码: eq85
Affymetrix 芯片数据处理流程的源代码
1.Read data
##1.ReadData##
#add bioconductor source
source("https://bioconductor.org/biocLite.R")
#install affy package from bioconductor
biocLite("affy")
#load affy package
library("affy")
#set the directory that you are working with,this can be replaced by your own CEL files path
setwd("/Users/Ren/Documents/Rcode/GSE11787_RAW")
#set the .CEL files path,this can be replaced by your own CEL files path
celpath="/Users/Ren/Documents/Rcode/GSE11787_RAW/"
#read .CEL files from directory of celfile.path
data = ReadAffy(celfile.path =celpath)
#replace the sample names of the data by trim off the ".CEL" suffix
sampleNames(data) = sub("\\.CEL$", "", sampleNames(data))
#read the phenoData of the CEL file.The "type.csv" file is consitituded with .csv format after you comprehensed the samples grouping on the GEO: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE11787
samplestype <- read.csv(paste(celpath,"type.csv",sep = ""),header = TRUE,row.names = 1,na.strings = "NA",sep = ',',stringsAsFactors = F,as.is = !default.stringsAsFactors())
#replace the rownames of the samplestype by its sampleID
rownames(samplestype) = samplestype$sampleID
#match the rownames of your phenoData to the samplenames of experiments data,which will return the matched index
mt = match(rownames(samplestype), sampleNames(data))
#give a completed column description of the phenoData
vmd = data.frame(labelDescription = c("Sample ID", "Samples type: Haemophilus Parasuis infected or control"))
#make the matched sampletype rows and varMetadata to your data's phenoData
phenoData(data) = new("AnnotatedDataFrame", data = samplestype[mt, ], varMetadata = vmd)
#erase samples which has no type
data = data[,!is.na(data$type)]
其中type.csv的文件形式如下:
2.Quality Control
##2.Quality Control##
#install the essential packages from Bioconductor
source("https://bioconductor.org/biocLite.R")
biocLite(c("affyQCReport","simpleaffy"))
#load the pakages of affyQCReport and simpleaffy
library("affyQCReport")
library("simpleaffy")
#Execute the Quality Control
saqc = qc(data)
#plot the quality control result
plot(saqc)
QC 之后的结果,如下图:
3.Data Normalization
##3.Data Normalization##
#packaged affyPLM provides another set of diagnostics that can be used to help assess array quality
source("https://bioconductor.org/biocLite.R")
biocLite(c("affyPLM"))
library("affyPLM")
#fit the basic probe-level model
dataPLM = fitPLM(data)
#plot normalized unscaled standard error (NUSE)
boxplot(dataPLM, main="NUSE", outline = FALSE, col="lightblue", las=3, whisklty=0, staplelty=0)
#plot relative log expression (RLE)
Mbox(dataPLM, main="RLE", outline = FALSE, col="mistyrose", las=3, whisklty=0, staplelty=0)
#if some sample was significantly elevated or more spread out,you can remove it as bad arrays
#here we don not have bad array ,the code next line is just a example to remove the sample "GSM298263",to execute this ,you can remove the sharp at the head of next line
#badArray = data[,-match(c("GSM298263"), sampleNames(data)) ]
normalized unscaled standard error (NUSE) 的各样本结果。
relative log expression (RLE) 的各样本结果。
4.Data preprocessing
##4.Data preprocessing##
#load affy package
library("affy")
#rma preprocessing:which concluded 1.background correction,2.between array normalization 3.reporter summarization
datarma = rma(data)
source("https://bioconductor.org/biocLite.R")
#install the porcines' db for get its specices annotation of the probe set
biocLite(c("porcine.db"))
#load the package of porcine.db
library("porcine.db")
#to filter out probe sets with no Entrez Gene identifiers and Affymetrix control probes
datafiltered = nsFilter(datarma, remove.dupEntrez=FALSE, var.cutof =0.5)$eset
#t-tests for rows of a matrix, intended to be speed efficient
#The function computes the t-statistic, the difference of the group means between the two disease groups, and the corresponding p-value by using the t-distribution.
datatt = rowttests(datafiltered, "type")
#draw the volcano plot using x-axis:dm(difference of the group means),y-axis:group log(p-value)
lod = -log10(datatt$p.value)
plot(datatt$dm, lod, pch=".", xlab="log-ratio",ylab=expression(-log[10]~p))
#line indicates an untransformed p-value of 0.01, so points above it will be significant
abline(h=2)
source("https://bioconductor.org/biocLite.R")
#install the porcine's db for get its specices annotation of the probe set
biocLite(c("limma"))
library("limma")
#make the design of the experiment as a matrix with its type
design = model.matrix(~datafiltered$type)
datalim = lmFit(datafiltered, design)
#When there are few replicates, the variance is not well estimated and the t-statistic can perform poorly.
#It can be solved by a improved method called empirical Bayes to give a sensible estimation of variance and t-statistic
#When sample sizes are moderate or large, say ten or more in each group, there is generally no advantage (but also no disadvantage) to using the Bayesian approach.
dataeb = eBayes(datalim)
火山图的结果(Volcano plot),黑线之上的基因代表p值显著的基因
5.GO analysis
##GO analysis##
source("https://bioconductor.org/biocLite.R")
#install the porcine's db for get its specices annotation of the probe set
biocLite(c("topGO"))
library(topGO)
#multiple testing problem corrected by multtest package
#Alternatively,the function topTable in the limma package provides multiple testing adjustment methods, including Benjamini and Hochberg’s false discovery rate (FDR), simple Bonferroni correction, and several others.
#for more information you can access:http://www.jianshu.com/p/9e97e9b351fd,and http://www.jianshu.com/p/a262cf3d18b9
#get the top 10 differential expressed gene by multiple testing correction of Benjamini-Hochberg (FDR) method
tabofallgene = topTable(dataeb, coef=2, adjust.method="BH", n=length(datatt[,1]))
#get the p-value of all genes named with the probe set name
geneList=setNames(tabofallgene[,5],rownames(tabofallgene))
annotation_db_name=paste(annotation(datafiltered),".db",sep="")
#load the annotation db package
library(package = annotation_db_name, character.only = TRUE)
#method to extract top differential expressed gene
topDiffGenes <- function(allScore) {
return(allScore < 0.01)
}
#get the count of the differential expressed gene set
sum(topDiffGenes(geneList))
#new a topGOdata type object and use biology process(a gene ontology category),all gene is geneList, differential expressed gene is topDiffGenes
sampleGOdata <- new("topGOdata", description = "Simple session", ontology = "BP",allGenes = geneList, geneSel = topDiffGenes,nodeSize = 10,annot = annFUN.db, affyLib = annotation_db_name)
#Performing the enrichment tests
#Fisher’s exact test which is based on gene counts
resultFisher <- runTest(sampleGOdata, algorithm = "classic", statistic = "fisher")
#Kolmogorov-Smirnov like test which computes enrichment based on gene scores.
resultKS <- runTest(sampleGOdata, algorithm = "classic", statistic = "ks")
resultKS.elim <- runTest(sampleGOdata, algorithm = "elim", statistic = "ks")
#make all GO enrichment result to a table called allRes
allRes <- GenTable(sampleGOdata, classicFisher = resultFisher,classicKS = resultKS, elimKS = resultKS.elim,orderBy = "elimKS", ranksOf = "classicFisher", topNodes = 10)
#draw the GO tree graph,with a depth of 10
tree_depth=10
showSigOfNodes(sampleGOdata, score(resultKS.elim), firstSigNodes = tree_depth, useInfo = 'all')
GO terms的DAG图:
6.可选步骤.绘制样本集的热图(heatmap)和level plot
##prestep:1.read data from .CEL files and 2.Quality control
##generate sample's level plot and heatmap##
#calculate the distance of a n*n matrix,and set the diagonal to zero
dd = dist2(log2(exprs(data)),diagonal=0)
#Hierarchical clustering of dd
dd.row <- as.dendrogram(hclust(as.dist(dd)))
#x or y axis sample order index in dd matrix
row.ord <- order.dendrogram(dd.row)
source("https://bioconductor.org/biocLite.R")
biocLite(c("latticeExtra"))
library("latticeExtra")
#add lengend
legend = list(top=list(fun=dendrogramGrob, args=list(x=dd.row, side="top")))
#give a level plot
lp = levelplot(dd[row.ord, row.ord],scales=list(x=list(rot=90)), xlab="", ylab="", legend=legend)
lp
install.packages(c("gplots","RColorBrewer"))
library(gplots)
library(RColorBrewer)
# creates a own color palette from red to green
my_palette <- colorRampPalette(c( "green","yellow", "red"))(n = 299)
gplots:::heatmap.2(dd, Rowv =FALSE,Colv=dd.row, col = my_palette, tracecol=NA,density.info="none",cexRow=0.4,cexCol=0.4)
热图如下:
level plot如下: