GloVe的介绍
GloVe是斯坦福大学提出的一种新的词矩阵生成的方法,综合运用词的全局统计信息和局部统计信息来生成语言模型和词的向量化表示。官方主页:
[http://nlp.stanford.edu/projects/glove/]
GloVe的安装与使用
在这里介绍ubuntu linux环境下C版本的Glove的使用。
下载代码
去官网[https://nlp.stanford.edu/projects/glove/]下载GloVe-1.2.zip
代码文件介绍:
进入glove目录下,首先先参考README.txt,里面主要介绍这个程序包含了四部分子程序,按步骤分别是vocab_count、cooccur、shuffle、glove。
1.vocab_count:用于计算原文本的单词统计(生成vocab.txt,每一行为:单词 词频)
2.cooccur:用于统计词与词的共现,目测类似与word2vec的窗口内的任意两个词(生成的是cooccurrence.bin二进制文件)
3.shuffle:对于2中的共现结果重新整理(生成的也是二进制文件cooccurrence.shuf.bin)
4.glove:glove算法的训练模型,会运用到之前生成的相关文件(1&3),最终会输出vectors.txt和vectors.bin(前者直接可以打开,主要针对它做研究,后者还是二进制文件)
训练词向量
- 解压文件,并上传该代码,,先将下载的文件解压到E盘下,上传
put -r E:\GloVe-1.2
- 打开文件GloVe-1.2,并编译程序,输入:
make
文件编译成功,会产生一个build文件。
3.执行文件,输入:
./sh demo.sh
成功的话就会产生coocurrence.bin,coocurrence.shuff.bin,vocab.txt,vectors.bin,和vectors.bin文件,打开vocab.txt就可以看到训练得到的词向量结果。
(1) 若出现如下问题,则是权限的问题
修改权限如下:
chmod +x demo.sh
(2) 若出现如下问题,
则需要修改demo.sh文件
$BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE
if [[ $? -eq 0 ]]
then
$BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE < $CORPUS > $COOCCURRENCE_FILE
if [[ $? -eq 0 ]]
then
$BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE
if [[ $? -eq 0 ]]
then
$BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE
if [[ $? -eq 0 ]]
then
if [ "$1" = 'matlab' ]; then
matlab -nodisplay -nodesktop -nojvm -nosplash < ./eval/matlab/read_and_evaluate.m 1>&2
elif [ "$1" = 'octave' ]; then
octave < ./eval/octave/read_and_evaluate_octave.m 1>&2
else
python eval/python/evaluate.py
fi
fi
fi
fi
fi
将以上代码修改为以下代码:
echo "$ $BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE"
$BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE
echo "$ $BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE < $CORPUS > $COOCCURRENCE_FILE"
$BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE < $CORPUS > $COOCCURRENCE_FILE
echo "$ $BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE"
$BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE
echo "$ $BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE"
$BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE
if [ "$CORPUS" = 'text8' ]; then
if [ "$1" = 'matlab' ]; then
matlab -nodisplay -nodesktop -nojvm -nosplash < ./eval/matlab/read_and_evaluate.m 1>&2
elif [ "$1" = 'octave' ]; then
octave < ./eval/octave/read_and_evaluate_octave.m 1>&2
else
echo "$ python eval/python/evaluate.py"
python eval/python/evaluate.py
fi
fi
该代码中用的语料库是从[http://mattmahoney.net/dc/text8.zip]下载text8语料库,最后训练得到的词向量结果如下:
如果你要训练自己的语料库,那么你可以修改demo.sh文件的以下内容:
make
if [ ! -e text8 ]; then
if hash wget 2>/dev/null; then
wget http://mattmahoney.net/dc/text8.zip
else
curl -O http://mattmahoney.net/dc/text8.zip
fi
unzip text8.zip
rm text8.zip
fi
//下面为Glove的相关参数
CORPUS=text8 // 这里是已经分好词的文件路径
VOCAB_FILE=vocab.txt //#输出的字典
COOCCURRENCE_FILE=cooccurrence.bin
COOCCURRENCE_SHUF_FILE=cooccurrence.shuf.bin
BUILDDIR=build
SAVE_FILE=vectors
VERBOSE=2
MEMORY=4.0
VOCAB_MIN_COUNT=5
VECTOR_SIZE=50 // 词向量维度
MAX_ITER=15
WINDOW_SIZE=15 // 窗口大小
BINARY=2 //生成二进制文件
NUM_THREADS=8
X_MAX=10