应审稿人要求| pseudo bulk差异分析

一、写在前面

最近有粉丝提问,收到了如下的审稿人意见:


1.png

审稿人认为在单细胞测序过程中,利用findMarker通过Wilcox获得的差异基因虽然考虑到了不同组别细胞数量的不同,但是未能考虑到每组样本数量的不同。因此作者希望纳入样本水平的pseudo-bulk分析能够有助于确认两种条件下的差异基因。首先我个人觉得审稿人自己的话有些矛盾,使用pseudo-bulk计算差异基因,岂不是无法考虑到不同组别中样本数量的差异?另外这有些吹毛求疵,在Seurat V5之前,作者甚至没有在包里集成pseudo-bulk的函数与算法(当然也可以自己提取矩阵计算)。难道能说作者发表的这几篇Cell和Nature给大家推荐的流程不好吗:
Hao, Hao, et al., Cell 2021 [Seurat v4]
Stuart, Butler, et al., Cell 2019 [Seurat v3]
Butler, et al., Nat Biotechnol 2018 [Seurat v2] Satija, Farrell, et al., Nat Biotechnol 2015 [Seurat v1]再者说,scRNA-seqBulk RNA-Seq更加的稀疏,将前者模拟为后者参与差异计算,其实也没那么科学。当然,审稿人的观点也不是全无道理,若能够通过不同的算法得到相同的差异基因结果,的确有较高的说服力。

二、pseudo-bulk差异分析走起

2.1 数据载入

# 加载R包
library(Seurat)``## 载入需要的程序包:SeuratObject``
## 载入需要的程序包:sp``##
## 载入程序包:'SeuratObject'``
## The following objects are masked from 'package:base':
##
##     intersect, t

# 读取数据:
scRNA <- readRDS('test_data/T1D_scRNA.rds')

# 这个数据包含24个样本:
unique(scRNA$sample)``##  [1] "D_503"  "H_120"  "H_630"  "H_3060" "D_609"  "H_727"  "H_4579" "D_504"
##  [9] "H_3128" "H_7108" "D_502"  "D_497"  "D_506"  "H_409"  "H_6625" "D_610"
## [17] "D_501"  "D_500"  "H_4119" "H_1334" "D_498"  "H_2928" "D_644"  "D_505"``# 包含两个组别的数据:
DimPlot(scRNA,split.by = 'Group')`
![2.png](https://upload-images.jianshu.io/upload_images/28196887-db0a2d0cc9c2d667.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)


## 2.2 差异计算

### (1) pseudo-bulk差异计算

`### 生成拟bulk 数据 **###**
bulk <- AggregateExpression(scRNA, return.seurat = T, slot = "counts", assays = "RNA",
                           group.by = c("cell_type", "sample", "Group")# 分别填写细胞类型、样本变量、分组变量的slot名称
   )``## Names of identity class contain underscores ('_'), replacing with dashes ('-')
## Centering and scaling data matrix
##
## This message is displayed once every 8 hours.``# 生成的是一个新的Seurat对象
bulk``## An object of class Seurat
## 41056 features across 345 samples within 1 assay
## Active assay: RNA (41056 features, 0 variable features)
##  3 layers present: counts, data, scale.data`

我们可以像普通`scRNA-seq`的`Seurat`对象一样,利用`FindMarkers()`进行差异分析,我们这里用`celltype10`做演示。

`# 取出celltype10对应的对象:
ct10.bulk <- subset(bulk, cell_type == "celltype10")
# 改变默认分类变量:
Idents(ct10.bulk) <- "Group"

# 下面的额计算依赖DESeq2,做过Bulk RNA-Seq的同学都知道:
**if**(!require(DESeq2))BiocManager::install('DESeq2')``## 载入需要的程序包:DESeq2``## 载入需要的程序包:S4Vectors``## 载入需要的程序包:stats4``## 载入需要的程序包:BiocGenerics``##
## 载入程序包:'BiocGenerics'``## The following object is masked from 'package:SeuratObject':
##
##     intersect``## The following objects are masked from 'package:stats':
##
##     IQR, mad, sd, var, xtabs``## The following objects are masked from 'package:base':
##
##     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
##     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
##     get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
##     tapply, union, unique, unsplit, which.max, which.min``##
## 载入程序包:'S4Vectors'``## The following object is masked from 'package:utils':
##
##     findMatches``## The following objects are masked from 'package:base':
##
##     expand.grid, I, unname``## 载入需要的程序包:IRanges``##
## 载入程序包:'IRanges'``## The following object is masked from 'package:sp':
##
##     %over%``## The following object is masked from 'package:grDevices':
##
##     windows``## 载入需要的程序包:GenomicRanges``## 载入需要的程序包:GenomeInfoDb``## 载入需要的程序包:SummarizedExperiment``## 载入需要的程序包:MatrixGenerics``## 载入需要的程序包:matrixStats``##
## 载入程序包:'MatrixGenerics'``## The following objects are masked from 'package:matrixStats':
##
##     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
##     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
##     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
##     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
##     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
##     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
##     colWeightedMeans, colWeightedMedians, colWeightedSds,
##     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
##     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
##     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
##     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
##     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
##     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
##     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
##     rowWeightedSds, rowWeightedVars``## 载入需要的程序包:Biobase``## Welcome to Bioconductor
##
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.``##
## 载入程序包:'Biobase'``## The following object is masked from 'package:MatrixGenerics':
##
##     rowMedians``## The following objects are masked from 'package:matrixStats':
##
##     anyMissing, rowMedians``##
## 载入程序包:'SummarizedExperiment'``## The following object is masked from 'package:Seurat':
##
##     Assays``## The following object is masked from 'package:SeuratObject':
##
##     Assays``# 差异计算:
bulk_deg <- FindMarkers(ct10.bulk, ident.1 = "D", ident.2 = "H", # 这样算出来的Fold Change就是D/H
                         slot = "counts", test.use = "DESeq2",# 这里可以选择其它算法
     verbose = F# 关闭进度提示
     )``## converting counts to integer mode``## gene-wise dispersion estimates``## mean-dispersion relationship``## final dispersion estimates``head(bulk_deg)# 看一下差异列表``##                        p_val avg_log2FC pct.1 pct.2 p_val_adj
## ENSG00000047346 1.900955e-05 -1.1201537 0.917 1.000  0.780456
## ENSG00000168685 7.606182e-05 -0.5817112 1.000 1.000  1.000000
## ENSG00000131759 9.696591e-05 -2.1668759 0.667 1.000  1.000000
## ENSG00000166750 2.136829e-04 -1.5473545 0.750 0.917  1.000000
## ENSG00000163947 3.127113e-04 -0.9136378 0.917 1.000  1.000000
## ENSG00000239713 4.090397e-04 -2.2337008 0.250 0.917  1.000000`

如何写循环计算所有细胞类型的差异基因,就留在这里当习题啦。

(2)细胞水平的差异计算

# 整理分组变量:
scRNA$CT_Group <- paste(scRNA$cell_type,scRNA$Group,sep = '_')

# 查看新的分组变量:
unique(scRNA$CT_Group)``##  [1] "celltype12_D" "celltype3_H"  "celltype13_H" "celltype6_H"  "celltype0_D"
##  [6] "celltype12_H" "celltype4_H"  "celltype11_D" "celltype14_H" "celltype9_H"
## [11] "celltype11_H" "celltype2_H"  "celltype0_H"  "celltype7_H"  "celltype14_D"
## [16] "celltype1_D"  "celltype4_D"  "celltype1_H"  "celltype8_H"  "celltype3_D"
## [21] "celltype13_D" "celltype8_D"  "celltype7_D"  "celltype5_H"  "celltype6_D"
## [26] "celltype15_H" "celltype2_D"  "celltype5_D"  "celltype10_H" "celltype9_D"
## [31] "celltype10_D" "celltype15_D"``# 差异计算:
cell_deg <- FindMarkers(scRNA,ident.1 = 'celltype10_D',ident.2 = 'celltype10_H' ,group.by = 'CT_Group')# 同样得到的是celltype10在D组 vs H组的结果``## For a (much!) faster implementation of the Wilcoxon Rank Sum Test,
## (default method for FindMarkers) please install the presto package
## --------------------------------------------
## install.packages('devtools')
## devtools::install_github('immunogenomics/presto')
## --------------------------------------------
## After installation of presto, Seurat will automatically use the more
## efficient implementation (no further action necessary).
## This message will be shown once per session

(3)两种算法的对比

library(dplyr)``##
## 载入程序包:'dplyr'``## The following object is masked from 'package:Biobase':
##
##     combine``## The following object is masked from 'package:matrixStats':
##
##     count``## The following objects are masked from 'package:GenomicRanges':
##
##     intersect, setdiff, union``## The following object is masked from 'package:GenomeInfoDb':
##
##     intersect``## The following objects are masked from 'package:IRanges':
##
##     collapse, desc, intersect, setdiff, slice, union``## The following objects are masked from 'package:S4Vectors':
##
##     first, intersect, rename, setdiff, setequal, union``## The following objects are masked from 'package:BiocGenerics':
##
##     combine, intersect, setdiff, union``## The following objects are masked from 'package:stats':
##
##     filter, lag``## The following objects are masked from 'package:base':
##
##     intersect, setdiff, setequal, union``# 先看一下两种算法的显著差异基因数量  
bulk_sig <- filter(bulk_deg,p_val < 0.05)
nrow(bulk_sig)``## [1] 308``cell_sig <- filter(cell_deg,p_val < 0.05)
nrow(cell_sig)``## [1] 1494

可以看出,pseudo-bulk得到的差异基因数量要少很多,画一个韦恩图看看二者交集

**if**(!require(VennDiagram))install.packages("VennDiagram")``## 载入需要的程序包:VennDiagram``## 载入需要的程序包:grid``## 载入需要的程序包:futile.logger``venn.plot <- venn.diagram(
 x = list(Bulk = rownames(bulk_sig), Cell = rownames(cell_sig)
),
 category.names = c("Bulk DEG", "Single-Cell DEG"),
 filename = NULL,
 output = TRUE,
 main = "Venn Diagram of Significant Genes"
)
grid.draw(venn.plot)
3.png

可以看出包含关系还是挺明显的,那我们再用交集基因的avg_log2FC做一个线性回归看看两次差异分析的相关性如何:

# 获得两次差异分析共同出现的基因:
inter_gene <- intersect(rownames(bulk_sig),rownames(cell_sig))
# 取出avg_log2FC整理为数据框
data4plot <- data.frame(Bulk = bulk_sig[inter_gene,'avg_log2FC'],
  Cell = cell_sig[inter_gene,'avg_log2FC']  )

# 线性回归分析:
lm.model <- lm(Bulk ~ Cell,data = data4plot)
summary(lm.model)#看一下统计学参数``##
## Call:
## lm(formula = Bulk ~ Cell, data = data4plot)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -1.01613 -0.24964 -0.04723  0.17148  2.19351
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.17249    0.02757  -6.257 1.58e-09 ***
## Cell         0.70081    0.01959  35.767  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4107 on 263 degrees of freedom
## Multiple R-squared:  0.8295, Adjusted R-squared:  0.8288
## F-statistic:  1279 on 1 and 263 DF,  p-value: < 2.2e-16``mypara <- coefficients(lm.model)#得到截距和斜率
a <- mypara[2]#斜率
b <- mypara[1]#截距
a <- round(a,2)#取两位有效数字
b <- round(b,2)

library(ggplot2)
library(ggpubr)``##
## 载入程序包:'ggpubr'``## The following object is masked from 'package:VennDiagram':
##
##     rotate``# 来个散点图吧~
lmplot <- ggplot( data4plot,aes(x=Bulk, y=Cell))+
 geom_point(color="black")+
 stat_smooth(method="lm",se=TRUE)+stat_cor(data=data4plot, method = "pearson")+#加上置信区间、R值、P值
 ggtitle(label = paste(": y = ", a, " * x + ", b, sep = ""))+geom_rug()+#加上线性回归方程
 labs(x='Bulk DEG',
      y= 'single-cell DEG')

lmplot``## `geom_smooth()` using formula = 'y ~ x'
4.png

R=0.91,那么R^2就是0.83,可以看出二者的相关性还是不错的,就看能不能过审稿人这关啦。

大家有什么新的想法,欢迎在评论区留言~

环境信息

sessionInfo()``## R version 4.4.1 (2024-06-14 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 22631)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=Chinese (Simplified)_China.utf8
## [2] LC_CTYPE=Chinese (Simplified)_China.utf8  
## [3] LC_MONETARY=Chinese (Simplified)_China.utf8
## [4] LC_NUMERIC=C                              
## [5] LC_TIME=Chinese (Simplified)_China.utf8    
##
## time zone: Asia/Shanghai
## tzcode source: internal
##
## attached base packages:
## [1] grid      stats4    stats     graphics  grDevices utils     datasets
## [8] methods   base    
##
## other attached packages:
##  [1] ggpubr_0.6.0                ggplot2_3.5.1              
##  [3] VennDiagram_1.7.3           futile.logger_1.4.3        
##  [5] dplyr_1.1.4                 DESeq2_1.44.0              
##  [7] SummarizedExperiment_1.34.0 Biobase_2.64.0            
##  [9] MatrixGenerics_1.16.0       matrixStats_1.4.1          
## [11] GenomicRanges_1.56.1        GenomeInfoDb_1.40.1        
## [13] IRanges_2.38.1              S4Vectors_0.42.1          
## [15] BiocGenerics_0.50.0         Seurat_5.1.0              
## [17] SeuratObject_5.0.2          sp_2.1-4                  
##
## loaded via a namespace (and not attached):
##   [1] RcppAnnoy_0.0.22        splines_4.4.1           later_1.3.2            
##   [4] tibble_3.2.1            polyclip_1.10-7         fastDummies_1.7.4      
##   [7] lifecycle_1.0.4         rstatix_0.7.2           globals_0.16.3        
##  [10] lattice_0.22-6          MASS_7.3-60.2           backports_1.5.0        
##  [13] magrittr_2.0.3          plotly_4.10.4           sass_0.4.9            
##  [16] rmarkdown_2.28          jquerylib_0.1.4         yaml_2.3.10            
##  [19] httpuv_1.6.15           sctransform_0.4.1       spam_2.10-0            
##  [22] spatstat.sparse_3.1-0   reticulate_1.39.0       cowplot_1.1.3          
##  [25] pbapply_1.7-2           RColorBrewer_1.1-3      abind_1.4-5            
##  [28] zlibbioc_1.50.0         Rtsne_0.17              purrr_1.0.2            
##  [31] GenomeInfoDbData_1.2.12 ggrepel_0.9.6           irlba_2.3.5.1          
##  [34] listenv_0.9.1           spatstat.utils_3.1-0    openintro_2.5.0        
##  [37] airports_0.1.0          goftest_1.2-3           RSpectra_0.16-2        
##  [40] spatstat.random_3.3-1   fitdistrplus_1.2-1      parallelly_1.38.0      
##  [43] leiden_0.4.3.1          codetools_0.2-20        DelayedArray_0.30.1    
##  [46] tidyselect_1.2.1        UCSC.utils_1.0.0        farver_2.1.2          
##  [49] spatstat.explore_3.3-2  jsonlite_1.8.8          progressr_0.14.0      
##  [52] ggridges_0.5.6          survival_3.6-4          tools_4.4.1            
##  [55] ica_1.0-3               Rcpp_1.0.13             glue_1.7.0            
##  [58] gridExtra_2.3           SparseArray_1.4.8       mgcv_1.9-1            
##  [61] xfun_0.47               withr_3.0.1             formatR_1.14          
##  [64] fastmap_1.2.0           fansi_1.0.6             digest_0.6.37          
##  [67] R6_2.5.1                mime_0.12               colorspace_2.1-1      
##  [70] scattermore_1.2         tensor_1.5              spatstat.data_3.1-2    
##  [73] utf8_1.2.4              tidyr_1.3.1             generics_0.1.3        
##  [76] data.table_1.16.0       usdata_0.3.1            httr_1.4.7            
##  [79] htmlwidgets_1.6.4       S4Arrays_1.4.1          uwot_0.2.2            
##  [82] pkgconfig_2.0.3         gtable_0.3.5            lmtest_0.9-40          
##  [85] XVector_0.44.0          htmltools_0.5.8.1       carData_3.0-5          
##  [88] dotCall64_1.1-1         scales_1.3.0            png_0.1-8              
##  [91] spatstat.univar_3.0-1   knitr_1.48              lambda.r_1.2.4        
##  [94] rstudioapi_0.16.0       tzdb_0.4.0              reshape2_1.4.4        
##  [97] nlme_3.1-164            cachem_1.1.0            zoo_1.8-12            
## [100] stringr_1.5.1           KernSmooth_2.23-24      parallel_4.4.1        
## [103] miniUI_0.1.1.1          pillar_1.9.0            vctrs_0.6.5            
## [106] RANN_2.6.2              promises_1.3.0          car_3.1-2              
## [109] xtable_1.8-4            cluster_2.1.6           evaluate_0.24.0        
## [112] readr_2.1.5             cli_3.6.3               locfit_1.5-9.10        
## [115] compiler_4.4.1          futile.options_1.0.1    rlang_1.1.4            
## [118] crayon_1.5.3            future.apply_1.11.2     ggsignif_0.6.4        
## [121] labeling_0.4.3          plyr_1.8.9              stringi_1.8.4          
## [124] viridisLite_0.4.2       deldir_2.0-4            BiocParallel_1.38.0    
## [127] munsell_0.5.1           lazyeval_0.2.2          spatstat.geom_3.3-2    
## [130] Matrix_1.7-0            RcppHNSW_0.6.0          hms_1.1.3              
## [133] patchwork_1.2.0         future_1.34.0           shiny_1.9.1            
## [136] highr_0.11              ROCR_1.0-11             broom_1.0.6            
## [139] igraph_2.0.3            bslib_0.8.0             cherryblossom_0.1.0
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 206,214评论 6 481
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 88,307评论 2 382
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 152,543评论 0 341
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 55,221评论 1 279
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 64,224评论 5 371
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 49,007评论 1 284
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 38,313评论 3 399
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,956评论 0 259
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 43,441评论 1 300
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,925评论 2 323
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,018评论 1 333
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,685评论 4 322
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 39,234评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,240评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,464评论 1 261
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 45,467评论 2 352
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,762评论 2 345

推荐阅读更多精彩内容