『读书笔记』- DESeq2 官方文件阅读笔记 01

正文前先容我扯两句蛋

昨天群里交流WGCNA的时候和大家聊到软件教程,现在很多人来了直接就网上各种教程搜一大堆,然后follow别人的code,照葫芦画瓢做出一堆结果,符合预期就是皆大欢喜,做不出来就愁眉不展...

回想起大三时准备考研,天天埋头图书馆啃王镜岩的生化,Gene8,分子生物学,当时总结出来所有的东西必须自己过一遍才能成为自己的东西,不要迷信别人总结的东西,别人总结的终究是别人的,背的滚瓜烂熟也变不成自己的,所以学习没有捷径,还是需要静下心来慢慢自己啃,这样遇到问题追本溯源,才能更好的解决问题。

follow别人的流程大多会遇到一个问题,就是如果出错了会一头雾水,找不到出错的原因,教程固然不错,但是仅仅限于参考,如果真正的想要“学会“一个软件,还是要回归作者官方的manual,FAQ,article等等,作为数学渣渣的我,算法(article)就不去碰了,太难了,但至少做到会使用软件,关键function的原理和功能,FAQ,manual是必须看看的。虽然follow别人,edgeR,DESeq2已经可以跑通了,但回去啃下官方的文档还是非常必要的。

关于算法原理部分,statquest连出了多个教程,慢慢看

Part1. 数据准备和参数设置

Why un-normalized counts?

As input, the DESeq2 package expects count data as obtained, e.g., from RNA-seq or another high-throughput sequencing experiment, in the form of a matrix of integer values. The value in the i-th row and the j-th column of the matrix tells how many reads can be assigned to gene i in sample j. Analogously, for other types of assays, the rows of the matrix might correspond e.g. to binding regions (with ChIP-Seq) or peptide sequences (with quantitative mass spectrometry). We will list method for obtaining count matrices in sections below.

  • 首先,inputdata应该是一个表达矩阵,类似下图:
  • 行名为转录本的ID,列名为样本的名字,
  • 矩阵中为表达量。
Transcript_ID Sample1 Sample2 Sample3 Sample4
Transcript1 0 1 10 5
Transcript2 100 104 107 100
Transcript3 30 35 37 39
Transcript4 42 44 45 200

The values in the matrix should be un-normalized counts or estimated counts of sequencing reads (for single-end RNA-seq) or fragments (for paired-end RNA-seq). The RNA-seq workflow describes multiple techniques for preparing such count matrices. It is important to provide count matrices as input for DESeq2’s statistical model (Love, Huber, and Anders 2014) to hold, as only the count values allow assessing the measurement precision correctly. The DESeq2 model internally corrects for library size, so transformed or normalized values such as counts scaled by library size should not be used as input.

  • 这里指出了标准化后的counts是不能用的,比如FPKM,RPKM,TPM等.
  • 原因是DESeq2会在内部校正文库大小。

The DESeqDataSet

The object class used by the DESeq2 package to store the read counts and the intermediate estimated quantities during statistical analysis is the DESeqDataSet, which will usually be represented in the code here as an object dds.

A technical detail is that the DESeqDataSet class extends the RangedSummarizedExperiment class of the SummarizedExperiment package. The “Ranged” part refers to the fact that the rows of the assay data (here, the counts) can be associated with genomic ranges (the exons of genes). This association facilitates downstream exploration of results, making use of other Bioconductor packages’ range-based functionality (e.g. find the closest ChIP-seq peaks to the differentially expressed genes).

A DESeqDataSet object must have an associated design formula. The design formula expresses the variables which will be used in modeling. The formula should be a tilde (~) followed by the variables with plus signs between them (it will be coerced into an formula if it is not already). The design can be changed later, however then all differential analysis steps should be repeated, as the design formula is used to estimate the dispersions and to estimate the log2 fold changes of the model.

Note: In order to benefit from the default settings of the package, you should put the variable of interest at the end of the formula and make sure the control level is the first level.

We will now show 4 ways of constructing a DESeqDataSet, depending on what pipeline was used upstream of DESeq2 to generated counts or estimated counts:

  • From transcript abundance files and tximport
  • From a count matrix
  • From htseq-count files
  • From a SummarizedExperiment object
  • 一个叫dds的object来储存DESeqDataSet, 这个dds必须关联一个设计公式(design formula)该公式会由一个~符号开始,后面连接相应的变量,不同变量间用+连接.
  • 根据上游不同的流程分别对应了不同的建立dds的方法.

这里主要看基于count matrix的,因为我比较喜欢用featurecount来做定量,此外测序公司返回的数据中一般也有readcount的matrix

Count matrix input

Alternatively, the function DESeqDataSetFromMatrix can be used if you already have a matrix of read counts prepared from another source. Another method for quickly producing count matrices from alignment files is the featureCounts function (Liao, Smyth, and Shi 2013) in the Rsubread package. To use DESeqDataSetFromMatrix, the user should provide the counts matrix, the information about the samples (the columns of the count matrix) as a DataFrame or data.frame, and the design formula.

To demonstate the use of DESeqDataSetFromMatrix, we will read in count data from the pasilla package. We read in a count matrix, which we will name cts, and the sample information table, which we will name coldata. Further below we describe how to extract these objects from, e.g. featureCounts output.

官方例子

library("pasilla")
pasCts <- system.file("extdata",
                      "pasilla_gene_counts.tsv",
                      package="pasilla", mustWork=TRUE)
pasAnno <- system.file("extdata",
                       "pasilla_sample_annotation.csv",
                       package="pasilla", mustWork=TRUE)
cts <- as.matrix(read.csv(pasCts,sep="\t",row.names="gene_id"))
coldata <- read.csv(pasAnno, row.names=1)
coldata <- coldata[,c("condition","type")]
## 得到的结果

head(cts,2)
##             untreated1 untreated2 untreated3 untreated4 treated1 treated2
## FBgn0000003          0          0          0          0        0        0
## FBgn0000008         92        161         76         70      140       88
##             treated3
## FBgn0000003        1
## FBgn0000008       70
coldata
##              condition        type
## treated1fb     treated single-read
## treated2fb     treated  paired-end
## treated3fb     treated  paired-end
## untreated1fb untreated single-read
## untreated2fb untreated single-read
## untreated3fb untreated  paired-end
## untreated4fb untreated  paired-end

## 发现顺序好像不一致

rownames(coldata) <- sub("fb", "", rownames(coldata))
all(rownames(coldata) %in% colnames(cts))
## [1] TRUE
all(rownames(coldata) == colnames(cts))
## [1] FALSE
cts <- cts[, rownames(coldata)]
all(rownames(coldata) == colnames(cts))
## [1] TRUE

## 拿到cts 和colnames后可以着手构建dds
library("DESeq2")
dds <- DESeqDataSetFromMatrix(countData = cts,
                              colData = coldata,
                              design = ~ condition)
dds
## class: DESeqDataSet 
## dim: 14599 7 
## metadata(1): version
## assays(1): counts
## rownames(14599): FBgn0000003 FBgn0000008 ... FBgn0261574
##   FBgn0261575
## rowData names(0):
## colnames(7): treated1 treated2 ... untreated3 untreated4
## colData names(2): condition type
############################################################################
# If you have additional feature data, it can be added to the DESeqDataSet by adding to the metadata columns of a newly constructed object. (Here we add redundant data just for demonstration, as the gene names are already the rownames of the dds.)
##########################################################################

featureData <- data.frame(gene=rownames(cts))
mcols(dds) <- DataFrame(mcols(dds), featureData)
mcols(dds)

## DataFrame with 14599 rows and 1 column
##                    gene
##                <factor>
## FBgn0000003 FBgn0000003
## FBgn0000008 FBgn0000008
## FBgn0000014 FBgn0000014
## FBgn0000015 FBgn0000015
## FBgn0000017 FBgn0000017
## ...                 ...
## FBgn0261571 FBgn0261571
## FBgn0261572 FBgn0261572
## FBgn0261573 FBgn0261573
## FBgn0261574 FBgn0261574
## FBgn0261575 FBgn0261575

Collapsing technical replicates

DESeq2 provides a function collapseReplicates which can assist in combining the counts from technical replicates into single columns of the count matrix. The term technical replicate implies multiple sequencing runs of the same library. You should NOT collapse biological replicates using this function. See the manual page for an example of the use of collapseReplicates.

  • 合并技术重复的counts,这里特别强调了不可以合并生物学重复的!!!

Differential expression analysis

The standard differential expression analysis steps are wrapped into a single function, DESeq. The estimation steps performed by this function are described below, in the manual page for ?DESeq and in the Methods section of the DESeq2 publication (Love, Huber, and Anders 2014).

Results tables are generated using the function results, which extracts a results table with log2 fold changes, p values and adjusted p values. With no additional arguments to results, the log2 fold change and Wald test p value will be for the last variable in the design formula, and if this is a factor, the comparison will be the last level of this variable over the reference level (see previous note on factor levels). However, the order of the variables of the design do not matter so long as the user specifies the comparison to build a results table for, using the name or contrast arguments of results.

Details about the comparison are printed to the console, directly above the results table. The text, condition treated vs untreated, tells you that the estimates are of the logarithmic fold change log2(treated/untreated).

  • 结果表格中通常包含了以下几个内容:

    1. log2foldchange;
    2. p value
    3. adjusted p value.(校正过的p值,或者叫 q value)
    4. 如果实验设计没有特殊的要求,log2FC和q value将会成为最终判定是否差异的标准。
  • 再实验设计的时候需要对sample进行标记,分清楚case和control,这部分在condition中分清楚。

举个🌰


dds <- DESeq(dds)
res <- results(dds)
res
## log2 fold change (MLE): condition treated vs untreated 
## Wald test p-value: condition treated vs untreated 
## DataFrame with 9921 rows and 6 columns
##                     baseMean      log2FoldChange             lfcSE
##                    <numeric>           <numeric>         <numeric>
## FBgn0000008 95.1442917575889 0.00227644122547027  0.22372865161848
## FBgn0000014 1.05652281859341  -0.495120386253493  2.14318579304427
## FBgn0000017 4352.55356876647   -0.23991894353759 0.126336905404352
## FBgn0000018  418.61048415965  -0.104673911941353 0.148489059780011
## FBgn0000024   6.406199980976   0.210847791831903 0.689587553167661
## ...                      ...                 ...               ...
## FBgn0261570 3208.38861003698    0.29553288972209 0.127350479276721
## FBgn0261572 6.19718814545467  -0.958822964597639 0.775314665691102
## FBgn0261573 2240.97951122377  0.0127194420456142 0.113299975906513
## FBgn0261574 4857.68037348332  0.0153924162130344 0.192567170474486
## FBgn0261575 10.6825203335563   0.163570514198072  0.93091062692625
##                           stat             pvalue              padj
##                      <numeric>          <numeric>         <numeric>
## FBgn0000008 0.0101750098121193  0.991881656848261 0.997210766670927
## FBgn0000014 -0.231020748579247  0.817298682951798                NA
## FBgn0000017  -1.89904084455535 0.0575591059082212 0.288001711413016
## FBgn0000018 -0.704926760910396  0.480855815353132 0.826833683766379
## FBgn0000024  0.305759277213403  0.759787936488384 0.943501114514855
## ...                        ...                ...               ...
## FBgn0261570   2.32062644287284 0.0203070137750067 0.144240002513884
## FBgn0261572  -1.23668880136939  0.216202637789157 0.607847805203261
## FBgn0261573  0.112263413507779  0.910614550167173  0.98265666676088
## FBgn0261574 0.0799327121809364  0.936290772501257 0.988179230260634
## FBgn0261575  0.175710223373603   0.86052160317937  0.96792800379094
##############################################
# Note that we could have specified the coefficient or contrast we want to build a results table for, using either of the following equivalent commands:
##############################################
res <- results(dds, name="condition_treated_vs_untreated")
res <- results(dds, contrast=c("condition","treated","untreated"))

One exception to the equivalence of these two commands, is that, using contrast will additionally set to 0 the estimated LFC in a comparison of two groups, where all of the counts in the two groups are equal to 0 (while other groups have positive counts). As this may be a desired feature to have the LFC in these cases set to 0, one can use contrast to build these results tables. More information about extracting specific coefficients from a fitted DESeqDataSet object can be found in the help page ?results. The use of the contrast argument is also further discussed below.

  • 存在一种情况,当这两个group中所有count都为0的时候,这两个group的log2fc也需要设为0。

Log fold change shrinkage for visualization and ranking

Shrinkage of effect size (LFC estimates) is useful for visualization and ranking of genes. To shrink the LFC, we pass the dds object to the function lfcShrink. Below we specify to use the apeglm method for effect size shrinkage (Zhu, Ibrahim, and Love 2018), which improves on the previous estimator.

We provide the dds object and the name or number of the coefficient we want to shrink, where the number refers to the order of the coefficient as it appears in resultsNames(dds).

刚开始理解不了log2FC收缩,看标题意思是可视化过程会变得好看,看了徐洲更老师的简书稍微理解了一点:

正常的log2FC MA plot

收缩后的log2FC

效果显而易见了。

其他参数:

Using parallelization

The above steps should take less than 30 seconds for most analyses. For experiments with complex designs and many samples (e.g. dozens of coefficients, ~100s of samples), one can take advantage of parallelized computation. Parallelizing DESeq, results, and lfcShrink can be easily accomplished by loading the BiocParallel package, and then setting the following arguments: parallel=TRUE and BPPARAM=MulticoreParam(4), for example, splitting the job over 4 cores. Note that results for coefficients or contrasts listed in resultsNames(dds) is fast and will not need parallelization. As an alternative to BPPARAM, one can register cores at the beginning of an analysis, and then just specify parallel=TRUE to the functions when called.

library("BiocParallel")
register(MulticoreParam(4))
  • 多线程并行, 利用BiocParallel 开始的时候将job拆分给4个线程,会加快计算速度。
  • 主要就是用register(MulticoreParam(4))

p-values and adjusted p-values

# We can order our results table by the smallest p value:

resOrdered <- res[order(res$pvalue),]

# We can summarize some basic tallies using the summary function.

summary(res)
## 
## out of 9921 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 518, 5.2%
## LFC < 0 (down)     : 536, 5.4%
## outliers [1]       : 1, 0.01%
## low counts [2]     : 1539, 16%
## (mean count < 6)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
How many adjusted p-values were less than 0.1?

sum(res$padj < 0.1, na.rm=TRUE)
## [1] 1054
## The results function contains a number of arguments to customize the results table which is generated. You can read about these arguments by looking up ?results. Note that the results function automatically performs independent filtering based on the mean of normalized counts for each gene, optimizing the number of genes which will have an adjusted p value below a given FDR cutoff, alpha. Independent filtering is further discussed below. By default the argument alpha is set to 0.1
## . If the adjusted p value cutoff will be a value other than 0.1
## , alpha should be set to that value:

res05 <- results(dds, alpha=0.05)
summary(res05)
## 
## out of 9921 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 407, 4.1%
## LFC < 0 (down)     : 431, 4.3%
## outliers [1]       : 1, 0.01%
## low counts [2]     : 1347, 14%
## (mean count < 5)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
sum(res05$padj < 0.05, na.rm=TRUE)
## [1] 838

Independent hypothesis weighting

A generalization of the idea of p value filtering is to weight hypotheses to optimize power. A Bioconductor package, IHW, is available that implements the method of Independent Hypothesis Weighting (Ignatiadis et al. 2016). Here we show the use of IHW for p value adjustment of DESeq2 results. For more details, please see the vignette of the IHW package. The IHW result object is stored in the metadata.

Note: If the results of independent hypothesis weighting are used in published research, please cite:

Ignatiadis, N., Klaus, B., Zaugg, J.B., Huber, W. (2016) Data-driven hypothesis weighting increases detection power in genome-scale multiple testing. Nature Methods, 13:7. 10.1038/nmeth.3885

  • 直译过来是独立假设权重,数学渣渣表示根本不知道这是个什么鬼,下面的解释是对p值得一个筛选,这个筛选基于一个优化的权重值。(还是听不懂,😭)
  • 太费脑经不打算搞懂了,反正姑且认为是一个和adjusted p value类似的东西吧...
  • 需要使用到一个叫IHW包,用法如下:
  • 如果使用了这个参数,需要引用上面的文献。
library("IHW")
resIHW <- results(dds, filterFun=ihw)
summary(resIHW)
## 
## out of 9921 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 507, 5.1%
## LFC < 0 (down)     : 545, 5.5%
## outliers [1]       : 1, 0.01%
## [1] see 'cooksCutoff' argument of ?results
## see metadata(res)$ihwResult on hypothesis weighting
sum(resIHW$padj < 0.1, na.rm=TRUE)
## [1] 1052
metadata(resIHW)$ihwResult
## ihwResult object with 9921 hypothesis tests 
## Nominal FDR control level: 0.1 
## Split into 6 bins, based on an ordinal covariate

发现和上面P value < 0.05的结果相比,差异的基因更多了。

下一部分更新 Exploring and exporting results

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 204,684评论 6 478
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 87,143评论 2 381
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 151,214评论 0 337
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,788评论 1 277
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,796评论 5 368
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,665评论 1 281
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 38,027评论 3 399
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,679评论 0 258
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 41,346评论 1 299
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,664评论 2 321
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,766评论 1 331
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,412评论 4 321
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 39,015评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,974评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,203评论 1 260
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 45,073评论 2 350
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,501评论 2 343

推荐阅读更多精彩内容