有了上一章预处理的文本数据,现在可以开始使用文本处理模型来训练文本。本章尝试使用word2vec模型来训练。
word2vec是Google在2013年推出的一个NLP(Natural Language Processing)工具,其特点是将词向量化,这样词与词之间就可以定量地去度量它们之间的关系,以及挖掘词与词之间的关系。
用词向量来表示词并不是word2vec的首创,最早的词嵌入是很冗长的,维度大小为整个词汇表的大小,对于每个具体的词汇表中的词,将对应的位置置为1。例如,如下图,有5个词组成的词汇表,词“Queen“的序号为2,那么的词向量就是(0, 1, 0, 0, 0)。同理,woman的词向量为(0, 0, 0, 1, 0)。这种词向量的编码方式一般叫做1-of-N Representattion或叫One-Hot独热编码。
用独热编码(One-Hot Encoding)来表示词向量非常简单,但是也有很多问题,最大的问题就是词汇表一般都非常大,比如达到数百万级别,这样每个词都用数百万维的向量来表示基本是不可能的,而且这样的向量除了一个位置是1,其他位置全部都是0,表达的效率实在不高。将其使用在卷积神经网络中会使得网络难以收敛。
word2vec是一种可以解决独热编码问题的方法,思路是通过训练将每个词都映射到一个较短的词向量上。所有的这些词向量就构成了向量空间,进而可以用通的统计学的方法来研究词与词之间的关系。
word2vec训练模型
Word2vec训练模型有两种,CBOW(Continuous Bag-of-Word)和Skip-gram。
CBOW模型
CBOW模型,连续词袋模型,是一个三层神经网络,该模型的特点是输入已知上下文,输出对当前单词的预测。如下图左。
Skip-gram模型
与CBOW正好相反,Skip-gram是由当前词预测上下文词,如上图右。
对于word2vec更细节的训练模型和训练方式,这里不做讨论,而主要介绍如何训练一个可以获得和使用的word2vec向量。
使用gensim包对文本数据进行训练
对于词向量的训练模型有很多种方法,最简单的使用Python工具包中的gensim包对文本数据进行训练。
训练word2vec模型
第一步是对词模型进行训练,代码如下,
import sys
import gensim
sys.path.append("../52/")
import AgNewsCsvReader
class LossLogger(gensim.models.callbacks.CallbackAny2Vec):
""""
Output loss at each epoch
"""
def __init__(self):
self.epoch = 1
self.losses = []
def on_train_begin(self,model):
print("Train started")
def on_epoch_begin(self, model):
print(f"Epoch {self.epoch}", end = '\t')
def on_epoch_end(self, model):
loss = model.get_latest_training_loss()
self.losses.append(loss)
print(f"Loss: {loss}")
self.epoch += 1
def on_train_end(self, model):
print("Train ended")
def train(strings, name, callback = LossLogger()):
model = gensim.models.word2vec.Word2Vec(strings, vector_size = 64, min_count = 0, window = 5, callbacks = [callback])
name = "/tmp/CorpusWord2Vec.bin"
model.save(name)
return model
def main():
train()
if __name__ == "__main__":
main()
代码里首先导入了上一章的数据集准备方法,然后导入gensim包。另外还定义了继承自gensim.models.callbacks.CallbackAny2Vec)的callback类,用以显示训练进度。
关于gensim.models.word2vec.Word2Vec的参数,解释如下,
def __init__(
self, sentences=None, corpus_file=None, vector_size=100, alpha=0.025, window=5, min_count=5,
max_vocab_size=None, sample=1e-3, seed=1, workers=3, min_alpha=0.0001,
sg=0, hs=0, negative=5, ns_exponent=0.75, cbow_mean=1, hashfxn=hash, epochs=5, null_word=0,
trim_rule=None, sorted_vocab=1, batch_words=MAX_WORDS_IN_BATCH, compute_loss=False, callbacks=(),
comment=None, max_final_vocab=None, shrink_windows=True,
)
- sentences,输入数据。
- workers并行运行的线程数。
- vector_size,词向量维数。
- min_count,最小的词频。
- window,上下文窗口大小。
- sample,对次品词汇进行采样是的设置。
- iter,迭代次数。
没有特殊要求,按默认设置即可。
model.save()函数用于将生成的模型进行存储,以供后续使用。
word2vec模型再训练
模型训练完成后,可以将其存储,但是随着训练文本样本的增加,需要再次训练以改善模型。gensim同样也提供了持续性训练模型的方法。代码如下,
def retrain(strings, name, callback = LossLogger()):
model = gensim.models.word2vec.Word2Vec.load(name)
model.train(strings, epochs = model.epochs, total_examples = model.corpus_count, callbacks = [callback])
return model
Word2vec提供了加载存储模型的函数,之后train函数将在已有训练的基础上继续对模型进行训练,在最初的训练集中,使用 AG News的标题title作为训练文本,在后续的训练中,可以使用正文description作为训练文本。这样合在一起作为更多的训练文件进行训练。
使用word2vec模型
在训练完成后,可以用完成的模型进行词汇的向量化,代码如下,
def vectorize(model, string):
string = AgNewsCsvReader.purify(string)
print(model.wv[string])
其中,string是需要转换的文本,同样需要调用purify进行文本清洗。之后使用已经训练好的模型对文本进行转换。
完整的代码如下,
import sys
import gensim
sys.path.append("../52/")
import AgNewsCsvReader
class LossLogger(gensim.models.callbacks.CallbackAny2Vec):
""""
Output loss at each epoch
"""
def __init__(self):
self.epoch = 1
self.losses = []
def on_train_begin(self,model):
print("Train started")
def on_epoch_begin(self, model):
print(f"Epoch {self.epoch}", end = '\t')
def on_epoch_end(self, model):
loss = model.get_latest_training_loss()
self.losses.append(loss)
print(f"Loss: {loss}")
self.epoch += 1
def on_train_end(self, model):
print("Train ended")
def train(strings, name, callback = LossLogger()):
model = gensim.models.word2vec.Word2Vec(strings, vector_size = 64, min_count = 0, window = 5, callbacks = [callback])
name = "/tmp/CorpusWord2Vec.bin"
model.save(name)
return model
def retrain(strings, name, callback = LossLogger()):
model = gensim.models.word2vec.Word2Vec.load(name)
model.train(strings, epochs = model.epochs, total_examples = model.corpus_count, callbacks = [callback])
return model
def vectorize(model, string):
string = AgNewsCsvReader.purify(string)
print(model.wv[string])
def main():
labels, titles, descriptions = AgNewsCsvReader.setup()
name = "/tmp/CorpusWord2Vec.bin"
callback = LossLogger()
train(titles, name, callback)
model = retrain(titles, name, callback)
text = "Deep Learning with JAX"
vectorize(model, text)
if __name__ == "__main__":
main()
运行结果打印输出如下,
[nltk_data] Downloading package stopwords to /tmp/...
[nltk_data] Package stopwords is already up-to-date!
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
Train started
Epoch 1 Loss: 0.0
Epoch 2 Loss: 0.0
Epoch 3 Loss: 0.0
Epoch 4 Loss: 0.0
Epoch 5 Loss: 0.0
Train ended
[[ 0.36099744 -1.1440252 0.4847822 0.7591337 0.85798764 -0.25476497
0.30977422 -0.21319509 1.5249188 -0.72062206 -0.4467937 0.14021704
-2.6198063 0.661755 -0.4327411 0.39416602 0.8436706 -0.21277171
-0.4151219 -0.54621285 -0.35042974 0.15172826 -1.0609459 -0.74437237
0.85207653 -0.60725677 0.33218148 0.6240094 0.7127134 -0.0417395
0.34569505 1.234781 0.81285703 -0.7366463 -0.7687304 0.06735715
-0.23438244 -0.12991829 -1.8167065 2.5446103 -0.00938168 -0.37465686
0.00299378 -0.20981383 1.6535916 -0.1554157 -1.1753728 1.5126673
0.14153121 1.6263632 -0.622682 0.38650435 0.8323702 -0.47859466
-1.1491399 1.2218014 0.9397419 -0.42536902 -0.05659487 0.5516389
-0.9549699 0.87242097 0.18625498 0.89083576]
[ 0.33479187 0.04653507 0.01926478 0.3875753 0.32637012 -0.32717916
-0.30709228 -0.9014067 -0.7812044 0.29786724 -0.33945233 -0.15428163
-0.29845345 -0.4976692 0.4737923 0.64622486 0.07032842 0.6627486
-0.1217719 1.2479517 1.2771091 -0.27718675 -0.8850082 0.02593476
-0.3363942 -0.13683066 -0.21131852 0.71538883 0.20307784 0.49359712
0.17839476 0.8253619 0.02555602 -0.8356611 0.28051728 0.4898358
-0.23563443 0.26072916 0.17412627 -0.04020917 -0.38397732 0.66754127
-0.40798146 0.7376347 0.18230589 0.14425813 -0.05379809 0.25963986
-0.15015112 -0.11462995 0.15018293 0.2490777 0.02795055 -0.6758653
-0.5651579 -0.20499875 0.4818798 -0.08807535 0.02335486 0.35514528
0.2359741 -0.5734041 -0.09972285 0.42608282]
[ 1.1425358 1.085829 0.06618865 -0.17689444 1.9250381 -0.16798918
-0.06429418 -1.2078681 -2.2152655 -0.02595476 -0.66030055 -0.31331846
-1.7492689 0.38104573 1.0095595 1.1981317 0.62744445 0.70126086
-0.51447636 0.23039319 1.4618404 -0.4425715 0.90491027 1.690043
0.5215146 1.370187 -1.0005503 2.0233798 -0.02657754 2.104903
0.37081173 0.82154423 -1.1229697 1.1556014 -0.25204104 0.39861757
-0.21694905 0.04076057 0.81902725 -0.00676204 0.5215546 0.7708205
0.37481806 0.5205949 0.316242 0.22936285 0.43045118 -0.23635614
0.9651361 0.30661744 1.6191809 -0.3772338 -0.2320475 1.6246716
-0.5283455 -0.9977228 0.6600026 1.0514705 -1.1038932 0.8083892
-1.00779 -0.51661724 -1.2668477 -0.5466114 ]
[ 0.6319297 0.4785565 0.02505263 0.7901241 1.2366308 0.46019888
1.7913702 0.6726955 0.7138597 -0.3485523 -0.06709349 0.06224275
0.72310245 -1.4524001 -0.32194382 -0.45648706 0.5118707 0.45032433
-0.66458523 1.8966521 0.5645725 -0.89448386 -0.49603692 0.7946624
0.12307948 0.5955876 0.5504677 0.8886738 -0.07013337 1.21222
0.41646817 -0.6045066 -0.04891211 -0.5843263 -0.01375762 0.69063437
-0.966545 -0.26620457 0.87677234 0.85140395 0.5473012 0.17918314
0.64876425 1.0385733 -0.4950508 0.37709272 -0.9933432 -0.53037596
-0.17313723 -0.44029772 1.0435693 -0.9378267 0.21616688 1.2259327
-1.0350102 -0.74258107 -0.07339547 0.3359737 -0.7050308 -1.128884
0.2915307 0.6961444 0.6425292 0.09696565]]
对于需要训练和测试的数据集,一般建议在使用的时候也一起训练,这样才能够获得最好的语义标注。在现实工程中,对数据的训练往往都有很大的训练样本,文本容量能够达到数百GB,不会产生语句缺失的问题,所以只需要在训练集上对文本进行训练即可。
结论
本章使用前一章准备好的AG News文本数据,通过gensim.models.word2vec模型对文本进行向量化处理。