今天我们来重温下强烈推荐的一篇经典的词向量训练模型——Glove。（大家可能比较熟悉的是word2vec，这篇后续我们也会来重温下，在大量语料的时候，glove的表现会优于word2vec）

一 Glove 模型介绍

glove 是基于词共现矩阵来做的，由于共现矩阵非常稀疏，这篇也主要是通过非零元素来进行训练。

Our model efficiently leverages statistical information by training onlyon the nonzero elements in a word-word co- occurrence matrix,rather than on the entire sparse matrix or on individual context windows in a large corpus.

本文的训练方法主要的两大思想：

1、利用全局的统计矩阵信息

2、利用局部的窗口特征信息

The result is a new global log- bilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods.

上述这两种方法其实分别是LSA和skip-gram的思想。LSA（潜在语义模型）是基于词-文档共现矩阵利用SVD等矩阵分解的方法将词语用更稠密的向量来表示。而skip-gram模型则是对每个句子的固定窗口进行语言模型（词语预测）学习。

While methods like LSA efficiently leverage statistical information, they do relatively poorly on the word analogy task, indicating a sub-optimal vector space structure. Methods like skip-gram may do better on the analogy task, but they poorly utilize the statistics of the corpus since they train on separate local context windows instead of on global co-occurrence counts.

1）矩阵分解方法（Matrix Factorization methods）

LSA模型就是应用了矩阵分解的方法，他们构建了term-document共现矩阵

In LSA, the ma- trices are of “term-document” type, i.e., the rows correspond to words or terms, and the columns correspond to different documents in the corpus.

2）基于窗口的模型（shallow window-based methods）

In the skip-gram and ivLBL models, the objec- tive is to predict a word’s context given the word itself, whereas the objective in the CBOW and vLBL models is to predict a word given its con- text.

glove既然是利用了上面的两个特性，那又是怎么做的呢？

首先根据文档集合构建word-wrod共现矩阵，假设Xij表示词语Xi和Xj的共现次数。X $X_i=\sum_j X_{ij}$

$P_{ij}=\frac{X_{ij}}{X_i}$

作者发现了一个很有趣的现象Pik/Pjk的比值和词语(Xi, Xj, Xk)之间的相关性有关，如表1所示，

1）k=solid和ice相关，steam无关，此时比值Pik/Pjk偏大，

2）相反如果k=gas和steam相关，ice无关，则此时比值Pik/Pjk较小。

3）而当k=water/fashion 和ice/steam同时相关/无关时候，比值接近1.

idea of Glove

The above argument suggests that the appropriate starting point for word vector learning should be with ratios of co-occurrence probabilities rather than the probabilities themselves.

用公式化表示上面这个关系

$F(w_i, w_j, w_k)=\frac{P_{ik}}{P_{jk}}$ (1)

那么该如何构建这个F呢？

我们知道词向量空间有个很经典的案例：

king-queen=man-woman

这个侧面反应出我们的词向量空间应该是一个线性结构的，因此我们可以上述的F进行改造：

Since vector spaces are inherently linear structures, the most natural way to do this is withvector differences. With this aim, we can restrict our consideration to those functionsF that depend only on the difference of the two target words,

$F(w_i-w_j,\widetilde w_k)=\frac{P_{ik}}{P_{jk}}$ (2)

$F((w_i-w_j)^T\widetilde w_k)=\frac{P_{ik}}{P_{jk}}$ (3)

我们对Pik也进行变换

$F((w_i-w_j)^T\widetilde w_k)=\frac{F(w_i^T\widetilde w_k)}{F(w_j^T\widetilde w_k)}$ (4)

其中我们有如下关系

$P_{ik}=\frac{X_{ik}}{X_i}=F(w_i^T\widetilde w_k)$ (5)

对两边分别取对数log

$w_i^T\widetilde w_k=log(P_{ik})=logX_{ik}-logX_i$ (6)

这里log(Xi)可以说和k是无关的，这里用bias来替换，如下

$w_i^T \widetilde w_k + b_i+\widetilde b_k=log(X_{ik})$ （7）

然后又对上述模型进行了改造，采用weighted least squares regression model 来构造损失误差

$J=\sum_{i,j=1}^{V} f(x_{ij})(w_i^T\widetilde w_j+b_i+\widetilde b_j-logX_{ij})^2$ (8)

其中f(Xij)是加权系数

$f(x)=(x/x_{max})^\alpha \ \ \ if x<x_{max} \ \ ; 1\ \ otherwise$ (9)

其中 alpha=3/4， Xmax=100

二 Glove模型和别的模型关系

我们知道skip-gram模型的目标是最大化下面的Qij（这里Qij就是一个softmax，目标就是希望给定单词i，然后找到最大的context Wj）

$Q_{ij}=\frac{exp(w_i^T\widetilde w_j)}{\sum_k exp(w_i^T\widetilde w_k)}$ (10)

从最大似然法我们有如下公式：

$J=\max [\prod Q_{ij}] \propto \min [-\sum_{i\in corpus; j\in context(i)} log(Q_{ij})]$ (11)

$J=-\sum_{i=1}^{V} \sum_{j=1}^{V} X_{ij}logQ_{ij}$ (12)

之前我们定义过Xi=\sum_j (Xij)

$J=-\sum_{i=1}^V X_i\sum_{j=1}^{V} P_{ij}log(Q_{ij})=\sum_{i=1}^{V}X_iH(P_i, Q_i)$ (13)

这里H(P, Q) 是对于分布P和Q的交叉熵(cross entropy)

As a weighted sum of cross-entropy error, this objective bears some formal resemblance to the weighted least squares objective of Eqn. (8)

到此为止，我们看到通过利用softmax和最大似然法得到的损失函数和我们公式（8）形式上很像了，但是有个问题，就是这里的概率分布Q是个计算量特别大的，尤其在Vocabulary很大的时候。因此，本文作者提出我们可以考虑另一个误差衡量的方法来避免大量的计算。一个比较简单的误差衡量就是平方误差啦。

To begin, cross entropy error is just one among many possible distance measures between probability distributions, and it has the unfortunate property that distributions with long tails are often modeled poorly with too much weight given to the unlikely events. Furthermore, for the measure to be bounded it requires that the model distributionQ be properly normalized. This presents a computational bottleneck owing to the sum over the whole vocabulary in Eqn. (10), and it would be desirable to consider a different distance measure that did not require this property ofQ.A natural choice would be a least squares objective in which normalization factors inQ andP are discarded,

$J=\sum_{i,j}X_i(\hat P_{ij}-\hat Q_{ij})^2=\sum_{i,j}X_i(X_{ij}-exp(w_i^T\widetilde w_j))^2$ (14)

这里不再是概率分布形式，这样引出另一个问题，就是Xij可能会很大，导致优化比较困难，作者对此做了个log变换，公式变成如下：

$J=\sum_{i,j}X_i(logX_{ij}-w_i^T\widetilde w_j)^2$ (15)

最后我们在对权重函数做个修改，

In fact, Mikolov et al. (2013a) observe that performance can be increased by filtering the data so as to re- duce the effective value of the weighting factor for frequent words. With this in mind, we introduce a more general weighting function, which we are free to take to depend on the context word as well

$J=\sum_{i,j} f(X_{ij})(logX_{ij}-w_i^T\widetilde w_j)^2$ (16)

三、模型评测指标

本文验证词语向量的评测用了如下几种方法

1）word analogy

这类任务的目标是希望能够回答诸如“a is to be as c is to ___”

The word analogy task con- sists of questions like, “a is tob asc is to ?”

这个任务能够检测在向量空间的某些结构是否满足

2）Word similarity

3）Named Entity Recognition

另外作者还对模型做了更进一步的分析

1）词语向量维度和窗口大小的影响

作者也对比了在选择不同的vector长度和window长度对结果的影响。这里window size用了两种方式，一个是对称的，即左右都有窗口大小。一个是非对称，即只有左边扩大

A context window that extends to the left and right of a target word will be called symmetric,

and one which extends only to the left will be called asymmetric.

vector dimension and window size

2）glove vs word2vec

glove vs. word2vec

这张图看起来有点耐人寻味，首先说明下坐标轴，横坐标其实是两个表示，对于Glove来说是训练的迭代次数，对于word2vec来说是负采样的样本数（negative samples），可以看到负采样的样本数不能过多，10个左右即可，过多了会影响模型的效果。不过很有意思的是，glove在analogy任务上是优于word2vec的。

原文提供了代码路径

We provide the source code for the model as well as trained word vectors at http://nlp. stanford.edu/projects/glove/.

UniformML-paper 4 Glove 《GloVe Global Vectors for Word Representation》

UniformML-paper 4 Glove 《GloVe Global Vectors for Word Representation》