Seurat4版本的WNN的运行与原理与softmax

Seurat4已经出了beta版本，最大的改变有两点
（1）推出了多模态分析，WNN（权重最近邻），由此可以联合单细胞转录组与蛋白组、ATAC，空间转录组等一起分析
（2）Rapid mapping of query datasets to references。也就是我们通常说的映射，如果有一个好的数据集，就可以通过映射的方式做到很多无监督做不到的事情，比如细胞定义，聚类等等。
官网在这里Seurat4
我们这里主要是对其中的分析点做一下功课。

什么是WNN？

首先我们了解一下多模态分析，多模态是直译的结果，英文是multimodal，也就是同一细胞的多种数据类型，比如说同一细胞的基因组数据、转录组数据、蛋白组数据、ATAC数据等等，而多模态分析就是指同一细胞的多个类型数据的联合分析。多模态分析represents a new and exciting frontier for single-cell genomics，我们现在每个人都有很多的单细胞相关的数据，甚至于文章中的多模态数据都可以拿来用，而多模态联合分析，就用到了这里说到的WNN。
WNN（weighted nearest neighbor analysis），直译就是权重最近邻分析，an unsupervised strategy to learn the information content of each modality in each cell, and to define cellular state based on a weighted combination of both modalities.可以看出来，多模态的数据主要是依据各个类型计算一个权重，由此来对多模态数据进行联合，具体的分析需要看文章的算法，我们这里详细了解一下。
文章在这里Integrated analysis of multimodal single-cell data
我们提炼一下其中的信息：
首先WNN的作用，文献中这样描述an analytical framework to integrate multiple data types measured within a cell, and to obtain a joint definition of cellular state。也就是说，依据多模态数据的联合分析来定义细胞的类型与状态。Our approach is based on an unsupervised strategy to learn cell-specific modality ‘weights’, which reflect the information content for each modality, and determine its relative importance in downstream analyses。
WNN的构建过程分为四步：
（1）Constructing independent knearest neighbor (KNN) graphs for both modalities。
（2）Performing within and across-modality prediction
（3）Calculating cell-specific modality weights
（4）Calculating a WNN graph
我们这里详细进行每一步的操作分析：

首先第一步：Constructing independent knearest neighbor (KNN) graphs for both modalities

机器学习KNN的讲解之前多次提到过，这里不再赘述，大家可以看我的文章机器学习的常用算法（生信必备),其中需要注意的是k值的选取，文章中给出了具体的过程：
1、对于单细胞数据：We analyze scRNA-seq data using standard pipelines in Seurat which include
normalization, feature selection, and dimensional reduction with PCA. We then construct a KNN graph
after dimensional reduction.这里基本就是Seurat的分析流程。
2、对于单细胞蛋白水平的表达：We analyze single-cell protein data (representing the quantification of antibody-derived tags (ADTs) in CITE-seq or ASAP-seq data) using a similar workflow to scRNA-seq. We normalize protein expression levels within a cell using the centered-log ratio (CLR) transform, followed by dimensional reduction with PCA, and subsequently construct a KNN graph. Unless otherwise specified, we do not perform feature selection on protein data, and use all measured proteins during dimensional reduction.这里可以看到，对蛋白数据进行不进行特征选择，直接降维，计算KNN graph。
3、对于单细胞ATAC数据：We analyze single-cell ATAC-seq data using our previously described workflow , as implemented in the Signac package. We reduced the dimensionality of the scATAC-seq data by performing latent semantic indexing (LSI) on the scATAC-seq peak matrix, as suggested by Cusanovich and Hill et al.. We first computed the term frequency-inverse document frequency (TFIDF) of the peak matrix by dividing the accessibility of each peak in each cell by the total accessibility in the cell (the “term frequency”), and multiplied this by the inverse accessibility of the peak in the cell population. This step ‘upweights’ the contribution of highly variable peaks and downweights peaks that are accessible in all cells. We then multiplied these values by 10,000 and log-transformed this TF-IDF matrix, adding a pseudocount of 1 to avoid computing the log of 0. We decomposed the TF-IDF matrix via SVD to return LSI components, and scaled LSI loadings for each LSI component to mean 0 and standard deviation 1. ATAC的数据有经验的话，大家可以多介绍一下。

第二步：Performing within and across-modality prediction

这一步是关键，我们来看看原理，我们以文章中的数据为例来进行解释：
Suppose we have a CITE-seq dataset where two modalities, RNA and protein, are measured in each single cell. From the previous step, we define the following:

23.png

这里是一些基础定义，大家需要知道：
We average the low-dimensional profiles of each neighbor set, which represents a prediction for the molecular contents for cell 𝑖 based on its local neighborhoods. We perform both within-modality and across-modality prediction:
内部预测（RNA）：

1.png

可以看出这个值的计算主要是对某一个细胞的邻居的低维数据求平均，而新的向量则代表了该细胞的分子含量预测。
同样的道理应用于蛋白数据：

1.png

注意这里还是数据内部的预测，接下来是不同数据类型之间的预测：

MF8}BZXLNNRNCL$@FV$9N40.png

我们分析这里的公式，可以看出来：
组间的预测关键在于，对于蛋白数据，我们预测其转录组水平，就是将该目标细胞的蛋白数据的邻居中，对应到转录组上的细胞求其平均，这样就是代表了该细胞的转录组水平预测，转录预测蛋白水平的方式也一样。这样对与转录组数据，我们每一个细胞都会有其预测的蛋白水平，蛋白数据的每个细胞都有预测的转录组水平，预测完了之后，我们就要计算权重了。

第三步：Calculating cell-specific modality weights

23.png

这里我们需要注意的是，预测值与真值的距离（欧氏距离），也就是说，这个距离越小，越接近我们测序得到的真实水平，而这个距离依据公式转化成亲和力。按照公式来看，亲和力大小与预测值和真值之间的差异高度相关，差别越大亲和力越小。

22.png

Our approach is inspired by the concept of large margin nearest neighbors, which aims to identify kernel bandwidths that separate data points in the same class from those in different classes, even if the classes are closely related. In the context of unsupervised single-cell analysis (where the data points are unlabeled),we aim to identify a kernel bandwidth that groups together cells in the same state, yet divides cells that originate from closely related (but different) states.
bandwidth的判定需要我们详细认识一下，这里我们了解其用法
Recent work has clearly demonstrated that KNN-graphs are prone to the formation of spurious edges, which represent links between cells that share some similarity molecular profiles, but are not in a matched molecular state. However, it is possible to identify these spurious edges through the use of the Jaccard metric. This identifies the number of shared nearest neighbors between two cells, thereby exploiting the local density of each data point to separate well-supported from spurious edges.
这里我们需要注意一个概念：Jaccard metric
这个指标的概念是:用于比较有限样本集之间的相似性与差异性。Jaccard系数值越大，样本相似度越高,其算法如下图：

index.png

这里我们需要知道。
接下来就是对bandwidth的计算：
For each cell 𝑖, we therefore aim to identify the 20 cells in the dataset with the lowest non-zero Jaccard similarity. We expect that these represent cells that exhibit some similarity with cell 𝑖, but are unlikely to reside in the same molecular state. If more than 20 cells share the same Jaccard value, we select the 20 with the furthest euclidean distance to cell 𝑖. We take the average of the Euclidean distances from cell 𝑖 to the 20 selected cells, and set this as the cell-specific kernel bandwidth。
也就是说，对bandwidth的定义主要依据该细胞与邻居之间欧氏距离的平均值。

第四步：Calculating cell-specific modality weights

计算模态数据的权重：又是一堆公式。

367Q7NHPV)SD2~XV%W[YS]A.png

简而言之，就是计算预测值占真值的一个比重，值越大预测的越好，但与此同时也会得到两个比率值，在这里，作者做了softmax transformation，然后计算亲和力比重。
这里我们需要知道一下softmax transformation：
如下图：

111.png

是不是很复杂？？？唯有数学不会，不会就是不会

第四步：Calculating a WNN graph

第三步我们得到了多模态数据之间的比重，接下来就是要构建WNN的graph。
We leverage the cell-specific modality weights calculated above to define a new similarity metric between
any two cells, which reflects a weighted combination of RNA and protein affinities. For two cells 𝑖 and cell
𝑗, we define their weighted similarity as:

22.png

两两细胞之间重新定义相似性，这个时候主要就是多模态联合进行数据分析。
We then construct a WNN graph, defined as a KNN graph constructed using this weighted similarity metric.
For each cell, we consider the set：

图片.png

and identify the k-most similar cells within this set based on the weighted similarity metric as weighted nearest neighbors.
到了这里，新的分析矩阵形成，就开始了下游分析，这里的矩阵就是融合矩阵，多模态分析从而实现。

至于代码很简单：
Seurat包里面的一个函数

bm <- FindMultiModalNeighbors(
  bm, reduction.list = list("pca", "apca"), 
  dims.list = list(1:30, 1:18), modality.weight.name = "RNA.weight"
)

请保持愤怒，让王多鱼倾家荡产

最后编辑于：2020.11.11 15:55:20

禁止转载，如需转载请通过简信或评论联系作者。

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 206,311评论 6赞 481
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 88,339评论 2赞 382
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 152,671评论 0赞 342
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 55,252评论 1赞 279
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 64,253评论 5赞 371
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 49,031评论 1赞 285
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 38,340评论 3赞 399
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,973评论 0赞 259
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 43,466评论 1赞 300
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,937评论 2赞 323
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 38,039评论 1赞 333
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,701评论 4赞 323
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 39,254评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 30,259评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,485评论 1赞 262
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 45,497评论 2赞 354
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,786评论 2赞 345