使用TCGAbiolinks从GDC Data Portal上下载
query = GDCquery(project = "TCGA-LAML", legacy = FALSE, experimental.strategy = "RNA-Seq", data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", workflow.type = "HTSeq - Counts")
workflow.type 有三种类型分别为“HTSeq - Counts”,“HTSeq - FPKM”,“HTSeq - FPKM-UQ”
这里自已通常下载count文件,TMM归一化
必须先下载,用下面的命令
GDCdownload(query)
会在当前目录下生成下面的一个文件夹,数据就在里面
否则下面的代码会报错,如下:(将下载的内容读进去)
GDCprepare: Reads the data downloaded and prepare it into an R object
dataAssy = GDCprepare(query, summarizedExperiment = F)
下载之后会这样,如下:(将下载的内容读进去)
下面就有两条路可以走了
1.
dataAssy = GDCprepare(query, summarizedExperiment = F)
rownames(dataAssy) = dataAssy[,1]
dataAssy = dataAssy[,-1]
colnames(dataAssy) = str_match(colnames(dataAssy), "(TCGA-[^-]*-[^-]*-[^-]*)")[,2]
dataAssyout = cbind(rownames(dataAssy), dataAssy)
colnames(dataAssyout)[1] = "Symbol"
dataAssyout$Symbol=as.character(dataAssyout$Symbol)
str(dataAssyout)
tt=tail(dataAssyout)
tt$Symbol=as.character(tt$Symbol)
#for(i in 1:nrow(dataAssyout)){
# dataAssyout$Symbol[i]=str_split(dataAssyout$Symbol[i],"\\.")[[1]][1]
#}
my_function=function(x) {x=str_split(x,"\\.")[[1]][1]
}
tt$Symbol=apply(data.frame(tt$Symbol),1,my_function)
dataAssyout$Symbol=apply(data.frame(dataAssyout$Symbol),1,my_function)
head(dataAssyout)
##去掉前五行
dataAssyout2=dataAssyout[-c(1:5),]
head(dataAssyout2)
write.table(dataAssyout2, "RNA_count_data.txt", row.names = F, sep = "\t", quote = F)
2.
dataAssy2 = GDCprepare(query, summarizedExperiment = T)
expMatrix <- TCGAanalyze_Preprocessing(dataAssy2 )
write.csv(expMatrix, "TCGA_count.csv", quote=F, row.names=T)
比对之后,二者count一摸一样
归一化 TMM
By running gdcVoomNormalization()
function, raw counts data would be normalized by TMM method implemented in edgeR(Robinson, McCarthy, and Smyth 2010) and further transformed by the voom method provided in limma(Ritchie et al. 2015). Low expression genes (logcpm < 1 in more than half of the samples) will be filtered out by default. All the genes can be kept by setting filter=TRUE
in the gdcVoomNormalization()
.
library(GDCRNATools)
####### Normalization of RNAseq data #######
rnaExpr <- gdcVoomNormalization(counts = expMatrix, filter = FALSE)