教程对应B站:【生信技能树】转录组测序数据分析
链接:https://www.bilibili.com/video/av28453557?from=search&seid=17370292426064271948
前提了解Linux基本操作和R语言
题目
链接:http://www.bio-info-trainee.com/3920.html
准备工作
1、参考基因组及注释文件下载地址
参考基因组存储网站三个:ENSEMBL、UCSC、NCBI
基因组注释文件:gencode数据库、ENSEMBL
注释文件可以告诉我们基因组每条染色体上哪些序列是编码蛋白的基因,哪些是非编码基因,外显子、内含子、UTR位置等等。
阅读文献
2、找到文章的测序数据
2018年12月的NC文章:Spatially and functionally distinct subclasses of breast cancer-associated fibroblasts revealed by single cell RNA sequencing()
文章研究了乳腺癌小鼠模型的768个间充质细胞转录组的单细胞RNA测序(scRNA-seq),定义了三个不同的CAF(癌症相关成纤维细胞)亚群。
3、下载测序数据
GEO链接:GSE111229
- GEO Platform (GPL)
- GEO Sample (GSM)
- GEO Series (GSE)
- GEO Dataset (GDS)
关于GEO数据库:一篇文章可以有一个或者多个GSE数据集,一个GSE里面可以有一个或者多个GSM样本。多个研究的GSM样本可以根据研究目的整合为一个GDS,不过GDS本身用的很少。而每个数据集都有着自己对应的芯片/平台,就是GPL。
原始数据链接:SRP133642
关于SRA数据库:用于存储二代测序的原始数据,包括454/illumina/SOLiD/IonTorrent/Helicos/CompleteGenomics
- Studies 研究课题 SRP
- Experiments 实验设计 SRX
- Runs 测序结果集 SRR
- Samples 样本信息 SRS
# 以下练习需要6个样本,我们从 SRR_Acc_List.txt 中选择6个保存为 list.txt
cat list.txt |while read id ;do (prefetch ${id} &);done
sra->fastq->bam->counts
4、任意挑选6个样本走标准的RNA-SEQ上游流程
# source activate rna 创建小环境运行
(rna) yxu 21:16:47 ~/project/rnatest
$ tree -L 1
.
├── 1.sra_data
├── 2.raw_fq
├── 3.fastq_qc
├── 4.mapping
└── 5.counts
5 directories, 0 files
sra->fastq
(rna) yxu 21:29:34 ~/project/rnatest/1.sra_data
$ ll
total 29M
-rw-rw-r-- 1 yxu yxu 4.3M Mar 14 09:39 SRR6791080.sra
-rw-rw-r-- 1 yxu yxu 8.8M Mar 14 09:39 SRR6791081.sra
-rw-rw-r-- 1 yxu yxu 5.1M Mar 14 09:39 SRR6791082.sra
-rw-rw-r-- 1 yxu yxu 5.9M Mar 14 09:39 SRR6791083.sra
-rw-rw-r-- 1 yxu yxu 1.8M Mar 14 09:39 SRR6791084.sra
-rw-rw-r-- 1 yxu yxu 3.4M Mar 14 09:39 SRR6791085.sra
# 查看文件大小和网站上的大小是否一致
## step1: fastq-dump
ls ~/public/project/RNA/airway/sra/* |while read id;do ( nohup fastq-dump --gzip --split-3 -O ./ ${id} & );done
(rna) yxu 21:35:46 ~/project/rnatest/2.raw_fq
$ ll
total 47M
-rw-rw-r-- 1 yxu yxu 6.8M May 14 23:50 SRR6791080.fastq.gz
-rw-rw-r-- 1 yxu yxu 15M May 14 23:51 SRR6791081.fastq.gz
-rw-rw-r-- 1 yxu yxu 8.2M May 14 23:51 SRR6791082.fastq.gz
-rw-rw-r-- 1 yxu yxu 9.5M May 14 23:51 SRR6791083.fastq.gz
-rw-rw-r-- 1 yxu yxu 2.4M May 14 23:52 SRR6791084.fastq.gz
-rw-rw-r-- 1 yxu yxu 5.0M May 14 23:52 SRR6791085.fastq.gz
## step2: check quality of sequence reads
ls *gz |xargs fastqc -t 10
multiqc ./
## step3:filter the bad quality reads and remove adaptors
(rna) yxu 21:35:46 ~/project/rnatest/2.raw_fq
$ cat qc.sh
cat id | while read id; do (trim_galore -q 25 --phred33 --length 36 -e 0.1 --stringency 3 -o ./ ${id});done
# id为6个样本的id号文件,自己创建
(rna) yxu 21:35:46 ~/project/rnatest/2.raw_fq
$ nohup bash qc.sh &
(rna) yxu 21:53:50 ~/project/rnatest/3.fastq_qc/clean
$ ll *gz
-rw-rw-r-- 1 yxu yxu 6.7M May 14 00:33 SRR6791080_trimmed.fq.gz
-rw-rw-r-- 1 yxu yxu 15M May 14 00:33 SRR6791081_trimmed.fq.gz
-rw-rw-r-- 1 yxu yxu 8.2M May 14 00:33 SRR6791082_trimmed.fq.gz
-rw-rw-r-- 1 yxu yxu 9.5M May 14 00:33 SRR6791083_trimmed.fq.gz
-rw-rw-r-- 1 yxu yxu 2.4M May 14 00:33 SRR6791084_trimmed.fq.gz
-rw-rw-r-- 1 yxu yxu 5.0M May 14 00:33 SRR6791085_trimmed.fq.gz
fastq->bam
(rna) yxu 21:59:51 ~/project/rnatest/3.fastq_qc/clean
$ cat mapping.sh
ls *gz|cut -d"_" -f 1 |sort -u |while read id;do hisat2 -p 10 -x /home/yxu/refernce/mm10/index/mm10/genome -U ${id}_trimmed.fq.gz |samtools sort -@ 2 -o ~/project/rnatest/4.mapping/${id}.bam;done
(rna) yxu 22:01:25 ~/project/rnatest/4.mapping
$ ls *bam
SRR6791080.bam SRR6791082.bam SRR6791084.bam
SRR6791081.bam SRR6791083.bam SRR6791085.bam
(rna) yxu 22:05:40 ~/project/rnatest/4.mapping
$ samtools view SRR6791080.bam | less -SN | head -n 10
# 查看bam文件
(rna) yxu 22:08:56 ~/project/rnatest/4.mapping
$ cat stat.sh
# 处理bam文件
ls *.bam |xargs -i samtools index {}
ls *.bam |while read id ;do ( nohup samtools flagstat -@ 1 $id > $(basename ${id} ".bam").flagstat & );done
(rna) yxu 23:00:05 ~/project/rnatest/4.mapping/stat
$ cat SRR6791080.flagstat
198223 + 0 in total (QC-passed reads + QC-failed reads)
268 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
134843 + 0 mapped (68.03% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
# 统计输入文件的相关数据并将这些数据输出至屏幕显示
bam->counts
(rna) yxu 23:02:30 ~/project/rnatest/5.counts
$ cat counts.sh
featureCounts -T 5 -p -t exon -g gene_id -a /home/yxu/refernce/mm10/Mus_musculus.GRCm38.96.chr.gtf -o ~/project/rnatest/5.counts/counts /home/yxu/project/rnatest/4.mapping/*.bam
# 得到counts矩阵
表达矩阵的多种形式
5、理解RNA-SEQ上游流程得到的表达矩阵的多种形式
转录组数据定量的方式,都是对表达量进行标准化的方法。
标准化的对象就是基因长度与测序深度。
- counts矩阵
每个基因比对到 reads数量
rpm矩阵
去除了每个细胞测序数据量(文库大小)差异rpkm矩阵
去除了基因长度效益tpm矩阵
参考资料:A short script to calculate RPKM and TPM from featureCounts output
差异分析
6、任取6个样本表达矩阵随意分成2组走差异分析代码
代码参考:https://github.com/jmzeng1314/GEO/tree/master/airway_RNAseq
需要汇总PCA,heatmap,火山图,MA图,CV图等等
rm(list = ls())
options(stringAsFactors = F)
# 测试数据表达矩阵下载
suppressMessages(library(GEOquery))
gse111229 <- getGEO("GSE111229", destdir=".", AnnotGPL = F, getGPL = F)
Gset <- gse111229[[1]]
pdata <- pData(Gset)
gse111229 <- gse111229$GSE111229_series_matrix.txt.gz
exprSet = as.data.frame(exprs(gse111229))
# 读取counts矩阵
mydata <- read.table("counts", header = TRUE,quote = '\t',skip = 1)
sampleNames <- c('SRR6791080','SRR6791081','SRR6791082','SRR6791083','SRR6791084','SRR6791085')
names(mydata)[7:12] <- sampleNames
head(mydata)
# 取Geneid和后6列样本
countMatrix <- as.matrix(mydata[7:12])
rownames(countMatrix) <- mydata$Geneid
head(countMatrix)
## SRR6791080 SRR6791081 SRR6791082 SRR6791083 SRR6791084
## ENSMUSG00000102693 0 0 0 0 0
## ENSMUSG00000064842 0 0 0 0 0
## ENSMUSG00000051951 0 0 0 0 0
## ENSMUSG00000102851 0 0 0 0 0
## ENSMUSG00000103377 0 0 0 0 0
## ENSMUSG00000104017 0 0 0 0 0
## SRR6791085
## ENSMUSG00000102693 0
## ENSMUSG00000064842 0
## ENSMUSG00000051951 0
## ENSMUSG00000102851 0
## ENSMUSG00000103377 0
## ENSMUSG00000104017 0
save(countMatrix,file = "expr.Rdata")
# 将六个样本分两组
group_list <- c('A','A','A','B','B','B')
condition <- factor(group_list)
- DEGseq
## step1: 把count矩阵转化为 DESeq2 的数据格式
suppressMessages(library(DESeq2))
dds <- DESeqDataSetFromMatrix(countMatrix, DataFrame(condition), design= ~ condition)
# 过滤掉那些 count 结果为0的数据,这些基因没有表达
dds <- dds[rowSums(counts(dds)) > 1,]
## step2: 归一化
suppressMessages(dds2 <- DESeq(dds))
rld <- rlogTransformation(dds2)
exprSet_new=assay(rld)
resultsNames(dds2)
## [1] "Intercept" "condition_B_vs_A"
## step3:提取差异分析结果
res <- results(dds2)
write.table(res,"result.csv", sep = ",", row.names = TRUE)
summary(res)
##
## out of 9251 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 4, 0.043%
## LFC < 0 (down) : 0, 0%
## outliers [1] : 478, 5.2%
## low counts [2] : 0, 0%
## (mean count < 0)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
# MA图
# M表示log fold change,衡量基因表达量变化,上调还是下调。A表示每个基因的count的均值
suppressMessages(library(geneplotter))
plotMA(res, main="DESeq2", ylim=c(-1,1))
# Heatmap图
sum(res$padj < 0.1, na.rm=TRUE)
suppressMessages(library("pheatmap"))
select <- order(rowMeans(counts(dds2,normalized=TRUE)),decreasing=TRUE)[1:1000]
nt <- normTransform(dds2) # defaults to log2(x+1)
log2.norm.counts <- assay(nt)[select,]
df <- as.data.frame(colData(dds2))
# pdf('heatmap1000.pdf',width = 6, height = 7)
pheatmap(log2.norm.counts, cluster_rows=TRUE, show_rownames=FALSE,
cluster_cols=TRUE, annotation_col=df)
# dev.off()
P-value:p值,是统计学检验变量,代表差异显著性,一般认为P < 0.05 为显著, P <0.01为非常显著。其含义为:由抽样误差导致样本间差异的概率小于0.05 或0.01。
Padj:p adjust,转录组测序的差异表达分析是对大量的基因表达值进行的独立统计假设检验,存在假阳性问题,因此引入Padj对显著性P值(P adjust)进行校正。Padj是对P-value的再判断,筛选更为严格。
Fold Change:适用于两分组分析,表示两样本(组)间表达量的比值。
log2FoldChange:对Fold Change取log2,一般默认表达相差2倍以上是有意义的,可以根据情况适当放宽至1.5/1.2,但最好不要低于1.2倍。
# 火山图
suppressMessages(library(ggplot2))
res$color <- ifelse(res$padj<0.05 & abs(res$log2FoldChange)>= 2,ifelse(res$log2FoldChange > 2,'red','blue'),'gray')
color <- c(red = "red",gray = "gray",blue = "blue")
res = as.data.frame(res)
#pdf('火山图.pdf',width = 6, height = 7)
ggplot(res, aes(log2FoldChange, -log10(padj), col = color)) +
geom_point() +
theme_bw() +
scale_color_manual(values = color) +
labs(x="log2 (fold change)",y="-log10 (q-value)") +
geom_hline(yintercept = -log10(0.05), lty=4,col="grey",lwd=0.6) +
geom_vline(xintercept = c(-1, 1), lty=4,col="grey",lwd=0.6) +
theme(legend.position = "none",
panel.grid=element_blank(),
axis.title = element_text(size = 16),
axis.text = element_text(size = 14))
#dev.off()
# 聚类分析
clust = t(countMatrix)
rownames(clust) <- colnames(countMatrix)
clust_dist <- dist(clust,method="euclidean")
hc <-hclust(clust_dist,"ward.D")
suppressMessages(library(factoextra))
#pdf('hc.pdf',width = 6, height = 7)
fviz_dend(hc, k = 4,
cex = 0.5,
k_colors = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"),
color_labels_by_k = TRUE, rect = TRUE)
#dev.off()
- edge2
rm (list = ls())
load("expr.Rdata")
exprSet <- countMatrix
group_list <- c('A','A','A','B','B','B')
table(group_list)
## group_list
## A B
## 3 3
suppressMessages(library(edgeR))
d <- DGEList(counts=exprSet,group=factor(group_list))
keep <- rowSums(cpm(d)>1) >= 2
table(keep)
## keep
## FALSE TRUE
## 49082 6368
d <- d[keep, , keep.lib.sizes=FALSE]
d$samples$lib.size <- colSums(d$counts)
## 归一化,使用edgeR中的calcNormFactor函数,用TMMf方法
d <- calcNormFactors(d)
d$samples
# 增加一列$norm.factors
## group lib.size norm.factors
## SRR6791080 A 80451 0.9744241
## SRR6791081 A 271535 0.6072700
## SRR6791082 A 128607 1.0075648
## SRR6791083 B 148799 1.0608086
## SRR6791084 B 22442 1.0858409
## SRR6791085 B 55050 1.4561091
dge=d
design <- model.matrix(~0+factor(group_list))
rownames(design)<-colnames(dge)
colnames(design)<-levels(factor(group_list))
dge=d
dge <- estimateGLMCommonDisp(dge,design)
dge <- estimateGLMTrendedDisp(dge, design)
dge <- estimateGLMTagwiseDisp(dge, design)
fit <- glmFit(dge, design)
lrt <- glmLRT(fit, contrast=c(1,-1))
nrDEG=topTags(lrt, n=nrow(dge))
nrDEG=as.data.frame(nrDEG)
head(nrDEG)
## logFC logCPM LR PValue FDR
## ENSMUSG00000031740 8.686641 10.805348 13.45857 0.0002438890 0.5066751
## ENSMUSG00000064080 10.544813 9.816362 12.28010 0.0004578135 0.5066751
## ENSMUSG00000019929 9.785441 10.890102 12.26568 0.0004613642 0.5066751
## ENSMUSG00000022360 -10.732952 9.994101 12.01815 0.0005268487 0.5066751
## ENSMUSG00000024909 9.312234 8.625223 11.83857 0.0005801650 0.5066751
## ENSMUSG00000031328 6.021571 10.642223 11.11854 0.0008546883 0.5066751
edgeR_DEG =nrDEG
nrDEG=edgeR_DEG[,c(1,5)]
colnames(nrDEG)=c('log2FoldChange','pvalue')
save(nrDEG, edgeR_DEG, file = "edgeR.Rdata")
- limma
rm (list = ls())
load("expr.Rdata")
exprSet <- countMatrix
group_list <- factor(c('A','A','A','B','B','B'))
table(group_list)
## step1: 处理group_list
suppressMessages(library(limma))
design <- model.matrix(~0+factor(group_list))
colnames(design)=levels(factor(group_list))
rownames(design)=colnames(exprSet)
design
## A B
## SRR6791080 1 0
## SRR6791081 1 0
## SRR6791082 1 0
## SRR6791083 0 1
## SRR6791084 0 1
## SRR6791085 0 1
## attr(,"assign")
## [1] 1 1
## attr(,"contrasts")
## attr(,"contrasts")$`factor(group_list)`
## [1] "contr.treatment"
## step2: 处理表达矩阵,过滤掉矩阵表达量太低或为0的
dge <- DGEList(counts=exprSet)
dge <- calcNormFactors(dge)
logCPM <- cpm(dge, log=TRUE, prior.count=3)
logCPM[1:4,1:4]
comp='A-B'
v <- voom(dge,design,plot=TRUE, normalize="quantile")
fit <- lmFit(v, design)
cont.matrix=makeContrasts(contrasts=c(comp),levels = design)
fit2=contrasts.fit(fit,cont.matrix)
fit2=eBayes(fit2)
## step3: 处理结束,拿数据
tempOutput = topTable(fit2, coef=comp, n=Inf)
DEG_limma_voom = na.omit(tempOutput)
head(DEG_limma_voom)
## logFC AveExpr t P.Value adj.P.Val
## ENSMUSG00000053907 4.405659 4.572770 119.25502 2.268083e-08 0.0008054679
## ENSMUSG00000025403 4.911621 4.829758 112.21979 2.905204e-08 0.0008054679
## ENSMUSG00000032714 5.335097 5.027842 70.25404 1.955055e-07 0.0011633706
## ENSMUSG00000033066 1.822959 3.285393 51.92294 6.690066e-07 0.0011633706
## ENSMUSG00000020520 4.507704 4.635493 44.98478 1.198824e-06 0.0011633706
## ENSMUSG00000032387 -6.577639 5.681617 -38.75511 2.197276e-06 0.0011633706
## B
## ENSMUSG00000053907 -23.06391
## ENSMUSG00000025403 -23.44447
## ENSMUSG00000032714 -25.87765
## ENSMUSG00000033066 -27.17210
## ENSMUSG00000020520 -28.02201
## ENSMUSG00000032387 -28.98618
nrDEG=DEG_limma_voom[,c(1,4)]
colnames(nrDEG)=c('log2FoldChange','pvalue')
save(nrDEG, DEG_limma_voom, file = "limma.Rdata")
7、挑选差异分析结果的统计学显著上调下调基因集
在R里面,对统计学显著上调下调基因集,进行GO/KEGG等数据库的超几何分布检验分析,原理参考:https://mp.weixin.qq.com/s/M6CRe39xmQ_lSQqeM99kow
基因富集分析(gene set enrichment analysis)是在一组基因或蛋白中找到一类过表达的基因或蛋白。
# 跳跃学习一下如何用Bioconductor对基因组注释
# 参考链接: https://www.jianshu.com/p/ae94178918bc
# 乖巧的安装:BiocManager::install("org.Mm.eg.db")
rm(list = ls())
options(stringsAsFactors = F)
suppressMessages(library("org.Mm.eg.db"))
suppressMessages(library("clusterProfiler"))
load("limma.Rdata")
colnames(DEG_limma_voom)
## [1] "logFC" "AveExpr" "t" "P.Value" "adj.P.Val" "B"
nrDEG = DEG_limma_voom
## step1: nrDEG添加一列change, 显示上下调
logFC_cutoff <- with( nrDEG, mean( abs( logFC ) ) + 2 * sd( abs( logFC ) ) )
logFC_cutoff
## [1] 1.994444
logFC_cutoff = 1.2
nrDEG$change = as.factor(ifelse( nrDEG$P.Value < 0.05 & abs(nrDEG$logFC) > logFC_cutoff,ifelse( nrDEG$logFC > logFC_cutoff , 'UP', 'DOWN' ), 'NOT' ))
table(nrDEG$change)
## DOWN NOT UP
## 741 54063 646
## step2: 提取ENSG的ID和GENE
keytypes(org.Mm.eg.db)
nrDEG$ENSEMBL <- rownames(nrDEG)
df <- bitr(rownames(nrDEG), fromType = "ENSEMBL", toType = c( "ENTREZID" ), OrgDb = org.Mm.eg.db )
head(df)
## step3: 将GENE合并到nrDEG数据框内
nrDEG$SYMBOL = rownames(nrDEG)
nrDEG = merge(nrDEG, df, by='ENSEMBL')
head(nrDEG)
## step4: 提取上调和下调的基因集
gene_up = nrDEG[ nrDEG$change == 'UP', 'ENTREZID' ]
gene_down = nrDEG[ nrDEG$change == 'DOWN', 'ENTREZID' ]
gene_diff = c(gene_up, gene_down)
gene_all = as.character(nrDEG[ ,'ENTREZID'] )
geneList = nrDEG$logFC
names(geneList) = nrDEG$ENTREZID
geneList = sort(geneList, decreasing = T )
gene_list = as.list(geneList)
save(gene_list,file = "gene_list.txt")
## KEGG pathway analysis
kk.up <- enrichKEGG(gene = gene_up,
organism = 'mmu',
universe = gene_all,
pvalueCutoff = 0.8,
qvalueCutoff = 0.8)
kk.down <- enrichKEGG(gene = gene_down,
organism = 'mmu',
universe = gene_all,
pvalueCutoff = 0.05)
suppressMessages(library(ggplot2))
kegg_down_dt <- as.data.frame( kk.down )
kegg_up_dt <- as.data.frame( kk.up )
down_kegg <- kegg_down_dt[ kegg_down_dt$pvalue < 0.05, ]
down_kegg$group = -1
up_kegg <- kegg_up_dt[ kegg_up_dt$pvalue < 0.05, ]
up_kegg$group = 1
dat = rbind( up_kegg, down_kegg )
dat$pvalue = -log10( dat$pvalue )
dat$pvalue = dat$pvalue * dat$group
dat = dat[ order( dat$pvalue, decreasing = F ), ]
save(dat,file = 'KEGG_enrich_results.Rdata')
load(file = 'KEGG_enrich_results.Rdata')
ggplot(dat, aes(x = reorder( Description, order( pvalue, decreasing=F ) ), y = pvalue, fill = group)) +
geom_bar( stat = "identity" ) +
scale_fill_gradient( low = "blue", high = "red", guide = FALSE ) +
scale_x_discrete( name = "Pathway names" ) +
scale_y_continuous( name = "log10P-value" ) +
coord_flip() + theme_bw() + theme( plot.title = element_text( hjust = 0.5 ) ) +
ggtitle( "Pathway Enrichment" )
## GO database analysis
g_list=list(gene_up=gene_up,
gene_down=gene_down,
gene_diff=gene_diff)
go_enrich_results <- lapply(g_list , function(gene) {
lapply( c('BP','MF','CC') , function(ont) {
cat(paste('Now process ',ont ))
ego <- enrichGO(gene= gene,
universe = gene_all,
OrgDb = org.Mm.eg.db,
ont = ont ,
pAdjustMethod = "BH",
pvalueCutoff = 0.99,
qvalueCutoff = 0.99,
readable = TRUE)
print( head(ego) )
return(ego)
})
})
save(go_enrich_results,file = 'go_enrich_results.Rdata')
# OrgDb: 物种注释数据库
# ont: 是BP(Biological Process), CC(Cellular Component), MF(Molecular Function),一个基因的功能可以从生物学过程,所属细胞部分,和分子功能三个角度定义。
load(file = 'go_enrich_results.Rdata')
n1= c('gene_up','gene_down','gene_diff')
n2= c('BP','MF','CC')
for (i in 1:3){
for (j in 1:3){
fn=paste0('GO_dotplot_',n1[i],'_',n2[j],'.png')
#cat(paste0(fn,'\n'))
png(fn,res=150,width = 1080)
print(dotplot(go_enrich_results[[i]][[j]]))
dev.off()}}
dotplot(go_enrich_results[[1]][[1]],showCategory = 10,font.size = 6)
8、直接对任取6个样本表达矩阵做GSVA分析
阅读了一些关于GSVA算法和R包的介绍:https://www.jianshu.com/p/6337e0dadc17
基因集合变异分析(Gene Set Variation Analysis,GSVA)
参考代码:https://github.com/jmzeng1314/GEO
gsva函数接受的是一个纯粹的表达矩阵matrix和一个纯粹基因集合list。
rm(list = ls())
options(stringsAsFactors = F)
# BiocManager::install('GSVA')
suppressMessages(library(GSVA))
load("expr.Rdata")
exprSet <- countMatrix
dim(exprSet)
## [1] 55450 6
group_list <- c('A','A','A','B','B','B')
table(group_list)
suppressMessages(library("org.Mm.eg.db"))
suppressMessages(library("clusterProfiler"))
## step1: 整理检查数据
g2s = toTable(org.Mm.egSYMBOL)
g2e = toTable(org.Mm.egENSEMBL)
a = as.data.frame(exprSet)
# 基因集下载:http://software.broadinstitute.org/gsea/msigdb/index.jsp
d = 'MsigDB/'
gmts <- list.files(d,pattern = 'all')
# 感觉自己对这部分理解还不够,留个问题
自己没有可研究的数据,从文献中挑选任意6个样本研究,过程中可能有理解偏差,如有问题,恳请大佬指导~~
更多学习资源:
生信技能树公益视频合辑
生信技能树简书账号
生信工程师入门最佳指南
生信技能树全球公益巡讲
招学徒
...
你的宣传能让数以万计的初学者找到他们的家,技能树平台一定不会辜负每一个热爱学习和分享的同道中人 😎