RNA-seq实战(中)_合并矩阵及DEseq2筛选差异并注释

前言:
写这篇文章的目的是为了梳理一下学习思路,按部就班地仿生信菜鸟团简书:Y大宽
教程大纲,做归纳整理,即便再次运行,仍然遇到代码出错或者软件使用等诸多问题,是以为之记。

合并表达矩阵并进行注释

(1) 载入数据,添加列名

> getwd()
[1] "C:/Users/Administrator/Documents"
> setwd("c:/Users/Administrator/Documents/data_analysis/")
> options(stringsAsFactors = FALSE)
> control1<-read.table("SRR957677.count",sep = "\t",col.names = c("gene_id","control1"))
> head(control1)
          gene_id control1
1 ENSG00000000003      732
2 ENSG00000000005        0
3 ENSG00000000419      367
4 ENSG00000000457      274
5 ENSG00000000460      482
6 ENSG00000000938        0
> control2<-read.table("SRR957678.count",sep = "\t",col.names = c("gene_id","control2"))
> treat1<-read.table("SRR957679.count",sep = "\t",col.names = c("gene_id","treat1"))
> treat2<-read.table("SRR957680.count",sep = "\t",col.names = c("gene_id","treat2"))
> tail(treat2)
                     gene_id  treat2
63677        ENSG00000273493       1
63678           __no_feature 9820489
63679            __ambiguous  265464
63680        __too_low_aQual       0
63681          __not_aligned  791712
63682 __alignment_not_unique 7667078

(2) 数据整合

 # 进行整合
> raw_count <- merge(merge(control1, control2, by="gene_id"), merge(treat1, treat2, by="gene_id"))
> head(raw_count)
                 gene_id control1 control2  treat1  treat2
1 __alignment_not_unique  6660638  2661188 6945070 7667078
2            __ambiguous   219910    88320  239237  265464
3           __no_feature  8576971  3638594 7755124 9820489
4          __not_aligned   713511   325330  792918  791712
5        __too_low_aQual        0        0       0       0
6        ENSG00000000003      732      317     702     857
 #这里要注意,我读入之后顺序变了,删除的时候看下删除的是哪些行
> raw_count_filt <- raw_count[-1:-5,]
> head(raw_count_filt)
           gene_id control1 control2 treat1 treat2
6  ENSG00000000003      732      317    702    857
7  ENSG00000000005        0        0      0      1
8  ENSG00000000419      367      155    354    474
9  ENSG00000000457      274      105    211    271
10 ENSG00000000460      482      211    442    525
11 ENSG00000000938        0        0      0      0

# 第一步将匹配到的以及后面的数字连续匹配并替换为空,并赋值给ENSEMBL
> ENSEMBL <- gsub("\\.\\d*", "", raw_count_filt$gene_id)
> # 将ENSEMBL重新添加到raw_count_filt1矩阵
> row.names(raw_count_filt) <- ENSEMBL
> raw_count_filt <- cbind(ENSEMBL,raw_count_filt)
> colnames(raw_count_filt)[1] <- c("ensembl_gene_id") 
> head(raw_count_filt)
                ensembl_gene_id         gene_id control1 control2 treat1 treat2
ENSG00000000003 ENSG00000000003 ENSG00000000003      732      317    702    857
ENSG00000000005 ENSG00000000005 ENSG00000000005        0        0      0      1
ENSG00000000419 ENSG00000000419 ENSG00000000419      367      155    354    474
ENSG00000000457 ENSG00000000457 ENSG00000000457      274      105    211    271
ENSG00000000460 ENSG00000000460 ENSG00000000460      482      211    442    525
ENSG00000000938 ENSG00000000938 ENSG00000000938        0        0      0      0

(3) 对基因进行注释-获取gene_symbol
用biomaRt对ensembl_id转换成gene_symbol

> library('biomaRt')
> library("curl")
> mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
> my_ensembl_gene_id <- row.names(raw_count_filt)
> options(timeout = 4000000)
> hg_symbols<- getBM(attributes=c('ensembl_gene_id','hgnc_symbol',"chromosome_name", "start_position","end_position", "band"), filters= 'ensembl_gene_id', values = my_ensembl_gene_id, mart = mart)
> head(hg_symbols)
 ensembl_gene_id hgnc_symbol chromosome_name start_position end_position   band
1 ENSG00000000003      TSPAN6               X      100627109    100639991  q22.1
2 ENSG00000000005        TNMD               X      100584802    100599885  q22.1
3 ENSG00000000419        DPM1              20       50934867     50958555 q13.13
4 ENSG00000000457       SCYL3               1      169849631    169894267  q24.2
5 ENSG00000000460    C1orf112               1      169662007    169854080  q24.2
6 ENSG00000000938         FGR               1       27612064     27635277  p35.3

将合并后的表达数据框raw_count_filt和注释得到的hg_symbols整合为一:

> readcount <- merge(raw_count_filt, hg_symbols, by="ensembl_gene_id")
> head(readcount)
  ensembl_gene_id         gene_id control1 control2 treat1 treat2 hgnc_symbol chromosome_name start_position end_position   band
1 ENSG00000000003 ENSG00000000003      732      317    702    857      TSPAN6               X      100627109    100639991  q22.1
2 ENSG00000000005 ENSG00000000005        0        0      0      1        TNMD               X      100584802    100599885  q22.1
3 ENSG00000000419 ENSG00000000419      367      155    354    474        DPM1              20       50934867     50958555 q13.13
4 ENSG00000000457 ENSG00000000457      274      105    211    271       SCYL3               1      169849631    169894267  q24.2
5 ENSG00000000460 ENSG00000000460      482      211    442    525    C1orf112               1      169662007    169854080  q24.2
6 ENSG00000000938 ENSG00000000938        0        0      0      0         FGR               1       27612064     27635277  p35.3

输出count矩阵文件

> write.csv(readcount, file='readcount_all,csv')
> readcount<-raw_count_filt[ ,-1:-2]
> write.csv(readcount, file='readcount.csv')
> head(readcount)
                control1 control2 treat1 treat2
ENSG00000000003      732      317    702    857
ENSG00000000005        0        0      0      1
ENSG00000000419      367      155    354    474
ENSG00000000457      274      105    211    271
ENSG00000000460      482      211    442    525
ENSG00000000938        0        0      0      0

备注:
因为我这里用到的是人样本测序数据,而教程里面都是小鼠,所以部分略有不同。 mart <- useDataset 用的是"hsapiens_gene_ensembl"
我这里的注释后,gene_id没有小数,所以ENSEMBL <- gsub("\\.\\d*", "", raw_count_filt$gene_id) 可操作可不操作,但是为了遵循流程,我还是按照教程一步步来。但是后面发现,合并时候出了很大问题,所以在前面操作中,我将第一列提取出来的ENSEMBL当做行名同时,还将其与数据合并cbind,并命名为enseble_gene_id,后面合并也是以这一列为准。否则,后面报错。

DEseq2筛选差异表达基因并注释(bioMart)

载入数据(countData和colData)

> mycounts <- read.csv("readcount.csv")
> head(mycounts)
                X control1 control2 treat1 treat2
1 ENSG00000000003      732      317    702    857
2 ENSG00000000005        0        0      0      1
3 ENSG00000000419      367      155    354    474
4 ENSG00000000457      274      105    211    271
5 ENSG00000000460      482      211    442    525
6 ENSG00000000938        0        0      0      0
#这里有个x,需要去除,先把第一列当作行名来处理
> rownames(mycounts)<-mycounts[,1]
 #把带X的列删除
> mycounts<-mycounts[,-1]
> head(mycounts)
                control1 control2 treat1 treat2
ENSG00000000003      732      317    702    857
ENSG00000000005        0        0      0      1
ENSG00000000419      367      155    354    474
ENSG00000000457      274      105    211    271
ENSG00000000460      482      211    442    525
ENSG00000000938        0        0      0      0
 # 这一步很关键,要明白condition这里是因子,不是样本名称;小鼠数据有对照组和处理组,各两个重复
> condition <- factor(c(rep("control",2),rep("treat",2)), levels = c("control","treat"))
> condition
[1] control control treat   treat  
Levels: control treat
#colData也可以自己在excel做好另存为.csv格式,再导入即可
> colData <- data.frame(row.names=colnames(mycounts), condition)
> colData
         condition
control1   control
control2   control
treat1       treat
treat2       treat

构建dds对象,开始DESeq流程

> dds <- DESeqDataSetFromMatrix(mycounts, colData, design= ~ condition)
> dds <- DESeq(dds)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
 # 查看一下dds的内容
> dds
class: DESeqDataSet 
dim: 63677 4 
metadata(1): version
assays(4): counts mu H cooks
rownames(63677): ENSG00000000003 ENSG00000000005 ... ENSG00000273492 ENSG00000273493
rowData names(22): baseMean baseVar ... deviance maxCooks
colnames(4): control1 control2 treat1 treat2
colData names(2): condition sizeFactor

总体结果查看

> res = results(dds, contrast=c("condition", "control", "treat"))
#或 res= results(dds)
> res = res[order(res$pvalue),]
> head(res)
log2 fold change (MLE): condition control vs treat 
Wald test p-value: condition control vs treat 
DataFrame with 6 rows and 6 columns
                        baseMean    log2FoldChange             lfcSE             stat               pvalue
                       <numeric>         <numeric>         <numeric>        <numeric>            <numeric>
ENSG00000178691 486.984162054358  2.92425454381134 0.240043462304368  12.182187824401  3.8676553386081e-34
ENSG00000135535 1120.04577690081  1.25399799851512 0.200354064486108  6.2589097043353 3.87678244698528e-10
ENSG00000196504 1696.62434746568  1.14790146165414 0.205928682549443 5.57426701051483 2.48574233542747e-08
ENSG00000141425 1185.45742498851 0.966418817241135  0.18074690909121    5.34680688096 8.95194410133491e-08
ENSG00000173905 500.115823536337  1.17911060574859 0.225062922468865 5.23902645897482 1.61425895781048e-07
ENSG00000164172 242.508963852987  1.25147639840061 0.243039125577329 5.14927954677166 2.61488912148493e-07
                                padj
                           <numeric>
ENSG00000178691 1.23146145981282e-30
ENSG00000135535 6.17183765560057e-07
ENSG00000196504 2.63820119866702e-05
ENSG00000141425 7.12574750466259e-05
ENSG00000173905 0.000102796010433372
ENSG00000164172                   NA
> summary(res)
out of 28482 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)       : 51, 0.18%
LFC < 0 (down)     : 7, 0.025%
outliers [1]       : 0, 0%
low counts [2]     : 25298, 89%
(mean count < 465)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
#所有结果先进行输出
> write.csv(res,file="All_results.csv")
> table(res$padj<0.05)

FALSE  TRUE 
 3159    25 
注释:dds=DESeqDataSet Object

summary(res),一共51个genes上调,7个genes下调,没有离群值。padj小于0.05的共有25个。

提取差异表达genes(DEGs)

获取padj(p值经过多重校验校正后的值)小于0.05,表达倍数取以2为对数后大于1或者小于-1的差异表达基因。代码如下:

> diff_gene_deseq2 <-subset(res, padj < 0.05 & abs(log2FoldChange) > 1)
或diff_gene_deseq2 <-subset(res,padj < 0.05 & (log2FoldChange > 1 | log2FoldChange < -1))
> dim(diff_gene_deseq2)
[1] 9 6
> head(diff_gene_deseq2)
log2 fold change (MLE): condition control vs treat 
Wald test p-value: condition control vs treat 
DataFrame with 6 rows and 6 columns
                        baseMean   log2FoldChange             lfcSE             stat               pvalue
                       <numeric>        <numeric>         <numeric>        <numeric>            <numeric>
ENSG00000178691 486.984162054358 2.92425454381134 0.240043462304368  12.182187824401  3.8676553386081e-34
ENSG00000135535 1120.04577690081 1.25399799851512 0.200354064486108  6.2589097043353 3.87678244698528e-10
ENSG00000196504 1696.62434746568 1.14790146165414 0.205928682549443 5.57426701051483 2.48574233542747e-08
ENSG00000173905 500.115823536337 1.17911060574859 0.225062922468865 5.23902645897482 1.61425895781048e-07
ENSG00000187772 665.194086507122 1.30236343380035 0.261527275189324 4.97983788825677 6.36375565744898e-07
ENSG00000152818 530.472516246284 1.23920837988832 0.260201177985899 4.76250103662281 1.91208215146831e-06
                                padj
                           <numeric>
ENSG00000178691 1.23146145981282e-30
ENSG00000135535 6.17183765560057e-07
ENSG00000196504 2.63820119866702e-05
ENSG00000173905 0.000102796010433372
ENSG00000187772 0.000337703300221959
ENSG00000152818 0.000869724224325012
> write.csv(diff_gene_deseq2,file= "DEG_treat_vs_control.csv")

用bioMart对差异表达基因进行注释

> library('biomaRt')
> library("curl")
> mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
> my_ensembl_gene_id<-row.names(diff_gene_deseq2)
> #listAttributes(mart)
> hg_symbols<- getBM(attributes=c('ensembl_gene_id','external_gene_name',"description"),
+                     filters = 'ensembl_gene_id', values = my_ensembl_gene_id, mart = mart)
> head(hg_symbols)
  ensembl_gene_id external_gene_name
1 ENSG00000011405            PIK3C2A
2 ENSG00000100731              PCNX1
3 ENSG00000135535              CD164
4 ENSG00000152818               UTRN
5 ENSG00000173905             GOLIM4
6 ENSG00000178691              SUZ12
                                                                                                  description
1 phosphatidylinositol-4-phosphate 3-kinase catalytic subunit type 2 alpha [Source:HGNC Symbol;Acc:HGNC:8971]
2                                                       pecanex homolog 1 [Source:HGNC Symbol;Acc:HGNC:19740]
3                                                           CD164 molecule [Source:HGNC Symbol;Acc:HGNC:1632]
4                                                                utrophin [Source:HGNC Symbol;Acc:HGNC:12635]
5                                       golgi integral membrane protein 4 [Source:HGNC Symbol;Acc:HGNC:15448]
6                            SUZ12, polycomb repressive complex 2 subunit [Source:HGNC Symbol;Acc:HGNC:17101]
#合并数据:res结果hg_symbols合并成一个文件
> head(diff_gene_deseq2)
log2 fold change (MLE): condition control vs treat 
Wald test p-value: condition control vs treat 
DataFrame with 6 rows and 6 columns
                        baseMean   log2FoldChange             lfcSE             stat               pvalue
                       <numeric>        <numeric>         <numeric>        <numeric>            <numeric>
ENSG00000178691 486.984162054358 2.92425454381134 0.240043462304368  12.182187824401  3.8676553386081e-34
ENSG00000135535 1120.04577690081 1.25399799851512 0.200354064486108  6.2589097043353 3.87678244698528e-10
ENSG00000196504 1696.62434746568 1.14790146165414 0.205928682549443 5.57426701051483 2.48574233542747e-08
ENSG00000173905 500.115823536337 1.17911060574859 0.225062922468865 5.23902645897482 1.61425895781048e-07
ENSG00000187772 665.194086507122 1.30236343380035 0.261527275189324 4.97983788825677 6.36375565744898e-07
ENSG00000152818 530.472516246284 1.23920837988832 0.260201177985899 4.76250103662281 1.91208215146831e-06
                                padj
                           <numeric>
ENSG00000178691 1.23146145981282e-30
ENSG00000135535 6.17183765560057e-07
ENSG00000196504 2.63820119866702e-05
ENSG00000173905 0.000102796010433372
ENSG00000187772 0.000337703300221959
ENSG00000152818 0.000869724224325012
> ensembl_gene_id<-rownames(diff_gene_deseq2)
> diff_gene_deseq2<-cbind(ensembl_gene_id,diff_gene_deseq2)
> colnames(diff_gene_deseq2)[1]<-c("ensembl_gene_id")
> diff_name<-merge(diff_gene_deseq2,hg_symbols,by="ensembl_gene_id")
> head(diff_name)
DataFrame with 6 rows and 9 columns
  ensembl_gene_id         baseMean   log2FoldChange             lfcSE             stat               pvalue
      <character>        <numeric>        <numeric>         <numeric>        <numeric>            <numeric>
1 ENSG00000011405 752.506353023592 1.11870520067213 0.270066040576769 4.14233940069974 3.43781090040641e-05
2 ENSG00000100731 559.720563162194 1.02663422935132 0.217107461114783 4.72869160774048 2.25971310040649e-06
3 ENSG00000135535 1120.04577690081 1.25399799851512 0.200354064486108  6.2589097043353 3.87678244698528e-10
4 ENSG00000152818 530.472516246284 1.23920837988832 0.260201177985899 4.76250103662281 1.91208215146831e-06
5 ENSG00000173905 500.115823536337 1.17911060574859 0.225062922468865 5.23902645897482 1.61425895781048e-07
6 ENSG00000178691 486.984162054358 2.92425454381134 0.240043462304368  12.182187824401  3.8676553386081e-34
                  padj external_gene_name
             <numeric>        <character>
1  0.00821214462583091            PIK3C2A
2 0.000899365813961783              PCNX1
3 6.17183765560057e-07              CD164
4 0.000869724224325012               UTRN
5 0.000102796010433372             GOLIM4
6 1.23146145981282e-30              SUZ12
                                                                                                  description
                                                                                                  <character>
1 phosphatidylinositol-4-phosphate 3-kinase catalytic subunit type 2 alpha [Source:HGNC Symbol;Acc:HGNC:8971]
2                                                       pecanex homolog 1 [Source:HGNC Symbol;Acc:HGNC:19740]
3                                                           CD164 molecule [Source:HGNC Symbol;Acc:HGNC:1632]
4                                                                utrophin [Source:HGNC Symbol;Acc:HGNC:12635]
5                                       golgi integral membrane protein 4 [Source:HGNC Symbol;Acc:HGNC:15448]
6                            SUZ12, polycomb repressive complex 2 subunit [Source:HGNC Symbol;Acc:HGNC:17101]
#查看CD164的情况
> CD164 <- diff_name[diff_name$external_gene_name=="CD164",]
> CD164
DataFrame with 1 row and 9 columns
  ensembl_gene_id         baseMean   log2FoldChange             lfcSE            stat               pvalue                 padj
      <character>        <numeric>        <numeric>         <numeric>       <numeric>            <numeric>            <numeric>
1 ENSG00000135535 1120.04577690081 1.25399799851512 0.200354064486108 6.2589097043353 3.87678244698528e-10 6.17183765560057e-07
  external_gene_name                                       description
         <character>                                       <character>
1              CD164 CD164 molecule [Source:HGNC Symbol;Acc:HGNC:1632]

至此,差异表达基因提取并注释完毕。

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 199,636评论 5 468
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 83,890评论 2 376
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 146,680评论 0 330
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 53,766评论 1 271
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 62,665评论 5 359
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,045评论 1 276
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 37,515评论 3 390
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,182评论 0 254
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 40,334评论 1 294
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,274评论 2 317
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,319评论 1 329
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,002评论 3 315
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 38,599评论 3 303
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,675评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 30,917评论 1 255
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 42,309评论 2 345
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 41,885评论 2 341

推荐阅读更多精彩内容