hi,各位好,今天我们努努力,看一下10X单细胞和10X空间转录组普遍存在的dropout现象对我们数据分析的影响和文章中的方法scDCC是如何规避的,文章在Model-based deep embedding for constrained clustering analysis of single cell RNA-seq data,2021年3月发表于NC,中国人发表的,不算低了,有关dropout的知识,大家可以看我之前分享的文章深度学习中Dropout原理解析(10X单细胞和10X空间转录组),做一个简单的了解,那我们来深入解读一下,看看如何解决这个问题。(什么时候我们才能自己写算法呢?而不是读和借鉴别人的)。
还是老办法,先分享文章,后示例代码
Absract
Clustering is a critical step in single cell-based studies. Most existing methods support unsupervised clustering without the a priori exploitation of any domain knowledge(领域知识).(我相信大家都是这样的吧,拿到矩阵之后,直接用Seurat进行降维聚类分析了,几乎没有用到什么先验的知识)When confronted by the high dimensionality and pervasive(普遍存在) dropout events of scRNA-Seq data, purely unsupervised clustering methods may not produce biologically interpretable clusters(大家遇到过么?有的cluster差异基因很少甚至没有,根本无法定义),which complicates cell type assignment.In such cases, the only recourse is for the user to manually and repeatedly tweak clustering parameters until acceptable clusters are found(是的,盲目调参,这或许也是科服和临检最大的鸿沟吧)。Consequently, the path to obtaining biologically meaningful clusters can be ad hoc(特设) and laborious.Here we report a principled clustering method named scDCC, that integrates domain knowledge into the clustering step(利用先验知识参与聚类,这个其实也很难),Experiments on various scRNA-seq datasets from thousands to tens of thousands of cells show that scDCC can significantly improve clustering performance, facilitating the interpretability of clusters and downstream analyses, such as cell type assignment.(最后一句话是套话,每篇文章都夸自己,不然发不出来😄)
introduction
这个地方我们提炼一下
目前常用的降维方法PCA、TSNE、UMAP、然后K-means、层次聚类进行可视化,including
SC37 (Spectral clustering), pcaReduce8 (PCA + k-means + hierarchical),TSCAN9 (PCA + Gaussian mixture model) and mpath10 (Hierarchical), to name a few(真的非常多,PCA的原理和深入探讨我之前分享过,大家可以翻阅一下),然后由于单细胞数据存在的稀疏性(这里就指dropout和基因水平的高度变化),这些传统的聚类方法其实会导致suboptimal results。
Recently, various clustering methods have been proposed to overcome the challenges in scRNA-seq data analysis.
(1)Shared nearest neighbor (SNN)-Clip combines a quasi-clique-based clustering algorithm with the SNN-based similarity measure to automatically identify clusters in the high-dimensional and highvariable
scRNA-seq data(SNN就是Seurat聚类用到的方法)。
(2)DendroSplit对通过层次聚类获得的树状图进行“分裂”和“合并”操作,该树状图根据细胞的成对距离(根据选定的基因计算)对细胞进行迭代分组,以揭示具有可解释的超参数的生物学上有意义的种群的多个水平 (层次聚类其实用到的很少)。
(3)If the dropout probability P(u) is a decreasing function of the gene expression u, CIDR uses a nonlinear least-squares regression to empirically estimate P(u) and imputes the gene expressions with a weighted average to alleviate the impact of dropouts.(个人感觉不太靠谱)。
(4)Clustering analysis is performed on the first few principal coordinates, obtained through principal coordinate analysis (PCoA) on the imputed expression matrix(这个大家都是这么做的,只是在选择多少个主成分上可能会有差异)。
(5)SIMLR and MPSSC are both multiple kernel-based spectral clustering methods. Considering the complexities of the scRNAseq data, multiple kernel functions can help to learn robust similarity measures that correspond to different informative representations of the data(恕我直言,这些方法我根本没有听说过,😂),However, spectral clustering relies on the full graph Laplacian matrix, which is prohibitively expensive to compute and store.(看来缺点很显著,怪不得没有听说过)。
(6)The high complexity and limited scalability generally impede applying these methods to large scRNA-seq datasets(单细胞数据具有的特点确实不能套用老一代的方法)。
模型部分
通过scRNA-seq进行分析的大量细胞为研究人员提供了独特的机会,可以应用深度学习方法对嘈杂而复杂的scRNA-seq数据进行建模。(这就是我的职业追求)。
(1)scScope and DCA(Deep Count Autoencoder) apply regular autoencoders to denoise single-cell gene expression data and impute(估算) the missing values(恕我之前,也没有听说过)。In autoencoders, the lowdimensional bottleneck layer enforces the encoder to learn only the essential latent representations and the decoding procedure ignores non-essential sources of variations of the expression data(很专业的东西,大家感兴趣可以查一下)。
(2)Compared to scScope, DCA explicitly models the overdispersion and zero-inflation with a zero-inflated negative binomial (ZINB) model-based loss function and learns gene-specific parameters (mean, dispersion and dropout probability) from the scRNA-seq data.(零膨胀负二项分布,不知道大家了解多少,用过scanpy的同学应该知道)。
(3)SCVI and SCVIS are variational autoencoders (VAE) focusing on dimension reduction of scRNAseq
data(这方法也没有听说过,看来实力还是很差啊). Unlike autoencoder, variational autoencoder assumes that latent representations learnt by the encoder follow a predefined distribution (typically a Gaussian distribution(高斯分布,单细胞i)). SCVIS uses the Student’s t-distributions(t分布) to replace the regular MSE-loss (mean square error) VAE, while SCVI applies the ZINB-loss VAE to characterize scRNA-seq data(分布上各有千秋). Variational autoencoder is a deep generative model(生成模型), but the assumption of latent representations following a Gaussian distribution might introduce the overregularization problem and compromise its performance(缺点依据很明显,怪不得没怎么用过😄)。
(4)More recently, Tian et al. developed a ZINB model-based deep clustering method (scDeepCluster) and showed that it could effectively characterize and cluster the discrete, over-dispersed and zero-inflated scRNA-seq count data.(自己写文章引用自己的文章,很不错,而且零膨胀负二项分布是单细胞最常用的分布),scDeepCluster combines the ZINB model-based autoencoder with the deep embedding clustering, which optimizes the latent feature learning and clustering simultaneously to achieve better clustering results.(作者认为好不管用,要我们认为可以)。
下游部分
Much of the downstream biological investigation relies on initial clustering results. Although clustering aims to explore and uncover new information,biologists expect to see some meaningful clusters that are consistent with their prior knowledge(典型的结果导向论,跟造假的距离不远了),In other words, totally exotic clustering with poor biological interpretability is puzzling, which is generally not desired by biologists.(但还是要基于客观事实)。For a clustering algorithm, it is good to accommodate biological interpretability while minimizing clustering loss from computational aspect(这是算法的目标),然而目前存在的算法只支持无监督聚类(有监督不见的比无监督好),有时候不符合之前的先验知识,If a method initially fails to find a meaningful solution, the only recourse may be for the user to manually and repeatedly tweak clustering parameters until sufficiently good clusters are found 。(这里大家要慎重啊,不要跟风)。
We note that prior knowledge has become widely available in many cases(但是不见得都对)。Quite a few cell type-specific signature sets have been published(每个样本的情况是不一样的,不能完全同一,搞一刀切). Ignoring prior information may lead to suboptimal, unexpected, and even illogical clustering results(这句话我不是特别赞同,算法的改进可以理解,但是人为因素过多,结果同样不好)。后面说了几个做细胞定义的软件,说句实话,细胞是一个动态的过程,想要靠软件识别是不太可能的,而且先验知识不一定就适合所有的情况,不同组织,不同来源,不同品系,不同处理都会导致细胞的改变,所以这里的观点我个人不太赞同。
However, there are several limitations of these methods.
(1)First,they are developed in the context of the marker genes and lack the flexibility to integrate other kinds of prior information. (人为因素千万不可过多)
(2)Second, they are only applicable to scenarios where cell types are predefined and well-studied marker genes exist. (这个也不太对)
Poorly understood cell types would be invisible to these methods. Finally, they both ignore pervasive dropout events, a well-known problem for scRNA-seq data。
In this article, we are interested in integrating prior information into the modeling process to guide our deep learning model to simultaneously learn meaningful and desired latent representations and clusters(先验知识和机器学习联合使用,有了人为因素,可要小心了),convert (partial) prior knowledge into soft pairwise constraints and add them as additional terms into the loss function for optimization(认为加入外界因素),这个属于半监督范畴,这个软件scDCC
scDCC encodes prior knowledge into constraint information,which is integrated to the clustering procedure via a novel loss function,当然,后面说自己的方法好,我们要批判性的看待(算法的部分我们在Method中分享**)。
Result1 Pairwise constraints.
Pairwise constraints mainly focus on the together or apart guidance as defined by prior information and domain knowledge. They enforce small divergence between predefined “similar” samples, while enlarging the difference between “dissimilar” instances.(说白了,限定先验知识的“距离”,相似样本和不相似样本的距离的限定),Researchers usually encode the together and apart information into must-link (ML) and cannot-link (CL) constraints, respectively(信息归类),With the proper setup, pairwise constraints have been proved to be capable of defining any ground-truth partition(这基本就是机器学习啊),In the context of scRNA-seq studies, pairwise constraints can be constructed based on the cell distance computed using marker genes(marker gene哪里来的?其他人的?不太靠谱吧), cell sorting using flow cytometry, or other methods depending on real application scenarios
To evaluate the performance of pairwise constraints,用到如下数据;
We selected 10% of cells with known labels to generate constraints in each dataset and evaluated the performance of scDCC on the remaining 90% of cells.(这个方法恕我直言,der),We show that the prior information encoded as soft constraints could help inform the latent representations of the remaining cells and therefore improve the clustering performance(这个地方简直没用)。
Three clustering metrics:
(1)normalized mutual information (NMI),range 0 to 1.
(2)clustering accuracy (CA),range 0 to 1。
(3)adjusted Rand index (ARI)(可参考兰德指数),which ARI can be negative.
(科普一下,兰德指数需要给定实际类别信息C,假设K是聚类结果,a表示在C与K中都是同类别的元素对数,b表示在C与K中都是不同类别的元素对数。评价同一object在两种分类结果中是否被分到同一类别。)
A larger value indicates better concordance between the predicted labels and ground truth. The number of pairwise constraints fed into the model explicitly controls how much prior information is applied in the clustering process(局限性挺大的)。
看看文章的试验结果
当然不错,文章的先验知识肯定是准备充分的。For datasets that are difficult to cluster, imposing a small set of pairwise constraints significantly improves the results.With 6000 pairwise constraints, scDCC achieves acceptable performance on all four datasets(有这先验还需要再验证么?全定义得了)。
A random subset of corresponding ML (blue lines) and CL (red lines) constraints are also plotted(tsne)。
As shown, the latent representations learned by the ZINB model-based autoencoder are noisy and different labels are mixed. Although the representations from scDeepCluster could separate different clusters, the inconsistency against the constraints still exists. Finally, by incorporating the soft constraints into the model training, scDCC was able to precisely separate the clusters and the results are consistent with both ML (blue lines) and CL (red lines) constraints.(自己的软件表现最好,感觉很废话,因为你要的多)。Overall, these results show that pairwise constraints can help to learn a better representation during the end-to-end learning procedure and improve clustering performance.
For the randomly selected 2100 cells in each dataset, we observed that scDCC with 0 constraint outperformed most competing scRNA-seq clustering methods(这个才是比较有意义的),(some strong methods outperformed scDCC with 0 constraints on some datasets, such as SC3 and Seurat on mouse bladder cells)(Seurat的聚类方法确实是比较好的),有了constraint的话scDCC表现最好,感觉比较扯。
下面的内容很重点
In real applications, we recognize that constraint information may not be 100% accurate(有一半的真实性就很不错了),To evaluate the robustness of the proposed method, we applied scDCC to the datasets with 5% and 10% erroneous pairwise constraints(有一定的先验错误率我们看看会怎么样),当然了,稳定性不错,不然见不到这个文章了,但错误率有点高的时候,这个方法完全不行了。Therefore, users
should take caution when adding highly erroneous constraints。当然,另外的验证结果也很好。
Result2 Robustness on highly dispersed genes.
Gene filtering is widely applied in many single-cell analysis pipelines(这一般是真正分析的第一步),One typical gene filtering strategy is to filter out low variable genes and only keep highly dispersed genes.(选择高变基因),Selecting highly dispersed genes could amplify the differences among cells but lose key information between cell clusters(这个,说的对么???)To evaluate the robustness of scDCC on highly dispersed genes, we conducted experiments on the top 2000 highly dispersed genes of the four datasets and displayed the performances of scDCC and baseline methods。当然也不错,但是用处不大。
Result3 Real applications and use cases.(看一下)
Generating accurate constraints is the key to successfully apply the proposed scDCC algorithm to obtain robust and desired clustering results(看来这是主要的限制条件了),两种方式
(1)Protein marker-based constraints.
(2)Marker gene-based constraints.
都需要人为先label 啊,看来任重而道远啊。
Methods
至于代码在这里,scDCC
读了这篇文章,感觉生命在流逝
生活很好,有你更好。