进行肿瘤的研究,求两个或多个基因的相关性,可以用GEPIA2或者直接下载数据用R语言处理,如果能再加上细胞系的相关性就能完美了。下面就分享下,手动下载CCLE测序数据,求任意两个基因的相关性。
1. 数据下载
CCLE收集了1000多种肿瘤细胞的RNA测序数据,在确定要研究的基因前不妨来这里初步探索下。首先下载TPM数据。CCLE网址:https://portals.broadinstitute.org/ccle/data
2.数据清洗
rm(list = ls())
rt <- data.table::fread("CCLE_RNAseq_rsem_genes_tpm_20180929.txt",data.table=F) %>%
select(-2)
ccle_ann <- data.table::fread("ccle_anno.txt",data.table=F) #注释文件
rt <- inner_join(ccle_ann,rt,by="gene_id") %>%
select(-1)
rt1 <- rt[,str_detect(colnames(rt),"KIDNEY")] #提取肾癌细胞株
rt1 <- cbind(name=rt$name,rt1)
rt1[,-1] <- log2(rt1[,-1]+1) #TPM数据进行log转换
3.相关性分析
gene1Name="NCK1"
gene2Name="NCK1-AS1"
x <- as.numeric(rt1[rt1$name==gene1Name,][,-1])
y <- as.numeric(rt1[rt1$name==gene2Name,][,-1])
# 相关性分析
z=lm(y~x)
corT=cor.test(x,y)
cor=corT$estimate
cor=round(cor,3)
pvalue=corT$p.value
if(pvalue<0.001){
pval=signif(pvalue,4)
pval=format(pval,scientific = TRUE)
}else(
pval=round(pvalue,3)
)
4.散点图
plot(x,y, type="p",pch=16,col="blue",main="CCLE cell lines (n=32)",
cex=1, cex.lab=1, cex.main=1,cex.axis=1,
xlab=paste(gene1Name,"expression log2(TPM+1)"),
ylab=paste(gene2Name,"expression log2(TPM+1)"))
lines(x,fitted(z),col=2)
text(0.5,2,paste("Cor=",cor,"\n p-value=",pval,sep=""))
参考文献:
Next-generation characterization of the Cancer Cell Line Encyclopedia
The landscape of cancer cell line metabolism