UniformML-paper 4 Glove 《GloVe Global Vectors for Word Representation》


一 Glove 模型介绍

glove 是基于词共现矩阵来做的,由于共现矩阵非常稀疏,这篇也主要是通过非零元素来进行训练。

Our model efficiently leverages statistical information by training onlyon the nonzero elements in a word-word co- occurrence matrix,rather than on the entire sparse matrix or on individual context windows in a large corpus.




The result is a new global log- bilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods.


While methods like LSA efficiently leverage statistical information, they do relatively poorly on the word analogy task, indicating a sub-optimal vector space structure. Methods like skip-gram may do better on the analogy task, but they poorly utilize the statistics of the corpus since they train on separate local context windows instead of on global co-occurrence counts.

1) 矩阵分解方法(Matrix Factorization methods)


In LSA, the ma- trices are of “term-document” type, i.e., the rows correspond to words or terms, and the columns correspond to different documents in the corpus.

2)基于窗口的模型(shallow window-based methods)

In the skip-gram and ivLBL models, the objec- tive is to predict a word’s context given the word itself, whereas the objective in the CBOW and vLBL models is to predict a word given its con- text.


首先根据文档集合构建word-wrod共现矩阵,假设Xij表示词语Xi和Xj的共现次数。XX_i=\sum_j X_{ij}


作者发现了一个很有趣的现象Pik/Pjk的比值和词语(Xi, Xj, Xk)之间的相关性有关,如表1所示,



3)而当k=water/fashion 和ice/steam同时相关/无关 时候,比值接近1.

idea of Glove

The above argument suggests that the appropriate starting point for word vector learning should be with ratios of co-occurrence probabilities rather than the probabilities themselves.


F(w_i, w_j, w_k)=\frac{P_{ik}}{P_{jk}}        (1)





Since vector spaces are inherently linear structures, the most natural way to do this is withvector differences. With this aim, we can restrict our consideration to those functionsF that depend only on the difference of the two target words,

F(w_i-w_j,\widetilde w_k)=\frac{P_{ik}}{P_{jk}}      (2)

F((w_i-w_j)^T\widetilde w_k)=\frac{P_{ik}}{P_{jk}}     (3)


F((w_i-w_j)^T\widetilde w_k)=\frac{F(w_i^T\widetilde w_k)}{F(w_j^T\widetilde w_k)}     (4)


P_{ik}=\frac{X_{ik}}{X_i}=F(w_i^T\widetilde w_k)        (5)


w_i^T\widetilde w_k=log(P_{ik})=logX_{ik}-logX_i      (6)


w_i^T \widetilde w_k + b_i+\widetilde b_k=log(X_{ik})      (7)

然后又对上述模型进行了改造,采用weighted least squares regression model 来构造损失误差

J=\sum_{i,j=1}^{V} f(x_{ij})(w_i^T\widetilde w_j+b_i+\widetilde b_j-logX_{ij})^2       (8)


f(x)=(x/x_{max})^\alpha  \ \ \ if x<x_{max}  \ \ ; 1\ \ otherwise    (9)

其中 alpha=3/4, Xmax=100

二 Glove模型和别的模型关系

我们知道skip-gram模型的目标是最大化下面的Qij(这里Qij就是一个softmax,目标就是希望给定单词i,然后找到最大的context Wj)

Q_{ij}=\frac{exp(w_i^T\widetilde w_j)}{\sum_k exp(w_i^T\widetilde w_k)}      (10)


J=\max [\prod Q_{ij}] \propto \min [-\sum_{i\in corpus; j\in context(i)} log(Q_{ij})]       (11)

J=-\sum_{i=1}^{V} \sum_{j=1}^{V} X_{ij}logQ_{ij}      (12)

之前我们定义过Xi=\sum_j (Xij)

J=-\sum_{i=1}^V X_i\sum_{j=1}^{V} P_{ij}log(Q_{ij})=\sum_{i=1}^{V}X_iH(P_i, Q_i)     (13)

这里H(P, Q) 是对于分布P和Q的交叉熵(cross entropy)

As a weighted sum of cross-entropy error, this objective bears some formal resemblance to the weighted least squares objective of Eqn. (8)


To begin, cross entropy error is just one among many possible distance measures between probability distributions, and it has the unfortunate property that distributions with long tails are often modeled poorly with too much weight given to the unlikely events. Furthermore, for the measure to be bounded it requires that the model distributionQ be properly normalized. This presents a computational bottleneck owing to the sum over the whole vocabulary in Eqn. (10), and it would be desirable to consider a different distance measure that did not require this property ofQ.A natural choice would be a least squares objective in which normalization factors inQ andP are discarded,

J=\sum_{i,j}X_i(\hat P_{ij}-\hat Q_{ij})^2=\sum_{i,j}X_i(X_{ij}-exp(w_i^T\widetilde w_j))^2     (14)


J=\sum_{i,j}X_i(logX_{ij}-w_i^T\widetilde w_j)^2      (15)


In fact, Mikolov et al. (2013a) observe that performance can be increased by filtering the data so as to re- duce the effective value of the weighting factor for frequent words. With this in mind, we introduce a more general weighting function, which we are free to take to depend on the context word as well

J=\sum_{i,j} f(X_{ij})(logX_{ij}-w_i^T\widetilde w_j)^2    (16)



1)word analogy

这类任务的目标是希望能够回答诸如“a is to be as c is to ___”

The word analogy task con- sists of questions like, “a is tob asc is to ?”


2)Word similarity

3)Named Entity Recognition


1) 词语向量维度和窗口大小的影响

作者也对比了在选择不同的vector长度和window长度对结果的影响。这里window size用了两种方式,一个是对称的,即左右都有窗口大小。一个是非对称,即只有左边扩大

A context window that extends to the left and right of a target word will be called symmetric,

and one which extends only to the left will be called asymmetric.

vector dimension  and window size

2)glove vs word2vec

glove vs. word2vec

这张图看起来有点耐人寻味,首先说明下坐标轴,横坐标其实是两个表示,对于Glove来说是训练的迭代次数,对于word2vec来说是负采样的样本数(negative samples),可以看到负采样的样本数不能过多,10个左右即可,过多了会影响模型的效果。不过很有意思的是,glove在analogy任务上是优于word2vec的。


We provide the source code for the model as well as trained word vectors at http://nlp. stanford.edu/projects/glove/.

