简介
本篇文章用于将英文句子转换为其对应的词性标注,结构如下图所示:
预处理
数据获取
数据来源于NLTK这个NLP的Python包,其中包含有部分标记好的句子,我们可以把这些数据写入到文本里面用做数据集。
import nltk
import numpy as np
sents=nltk.corpus.treebank.tagged_sents()
fedata=open('treebank_sents.txt','w')
ffdata=open('treebank_poss.txt','w')
for sent in sents:
words,poss=[],[]
for word,pos in sent:
if (pos=='-NONE-'):
continue
words.append(word)
poss.append(pos)
fedata.write("{}\n".format(" ".join(words)))
ffdata.write("{}\n".format(" ".join(poss)))
fedata.close()
ffdata.close()
我们来看一下数据是怎么样的:
Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .
NNP NNP , CD NNS JJ , MD VB DT NN IN DT JJ NN NNP CD .
像上一篇文章一样,我们需要决定RNN网络输入的时间序列的长度。并且由于是单词级别的输入,还需要准备一个Embedding层,因此还需要决定单词数量有多少个。训练集数据有多少条也需要统计。最后由于是多对多的模型,以上数据对于Output 的序列Y来说也需要决定。
import collections
def parse_sentences(filename):
word_freqs=collections.Counter()
num_recs,max_len=0,0
with open(filename) as f:
for l in f:
words=l.lower().strip().split()
for w in words:
word_freqs[w]+=1
if(len(words)>max_len):
max_len=len(words)
num_recs+=1
return word_freqs,max_len,num_recs
s_freq,s_maxlen,s_num=parse_sentences("treebank_sents.txt")
t_freq,t_maxlen,t_num=parse_sentences("treebank_poss.txt")
print("Words:",len(s_freq)," Max Seq Len:",s_maxlen," Records Num:",s_num)
print("Words:",len(t_freq)," Max Seq Len:",t_maxlen," Records Num:",t_num)
统计数据如下:
Words: 10947 Max Seq Len: 249 Records Num: 3914
Words: 45 Max Seq Len: 249 Records Num: 3914
训练集准备
由以上数据我们将Input词典数量定为5000,Output词典数量定为45,句子最长设定为100,并由此制作映射表转为Keras能够处理的数字形式。
MAX_SEQLEN=100
S_FEATURES=5000
T_FEATURES=45
s_vocabsize=S_FEATURES+2
s_word2index={w[0]:i+2 for i,w in enumerate(s_freq.most_common(S_FEATURES))}
s_word2index['PAD']=0
s_word2index['UNK']=1
s_index2word={v:k for k,v in s_word2index.items()}
t_vocabsize=T_FEATURES+1
# 原书籍中这里有错
t_word2index={w[0]:i+1 for i,w in enumerate(t_freq.most_common(T_FEATURES))}
t_word2index['PAD']=0
t_index2word={v:k for k,v in t_word2index.items()}
然后构建数据集并检查一下形状:
from keras.utils import to_categorical
from keras.preprocessing import sequence
def build_tensor(filename,num_recs,word2index,max_len,
make_categorical=False,num_classes=0):
data=np.empty((num_recs,),dtype=list)
fin=open(filename,'r')
for i,line in enumerate(fin):
wids=[]
words=line.lower().strip().split()
for w in words:
if(w in word2index.keys()):
wids.append(word2index[w])
else:
wids.append(word2index['UNK'])
# 如果是构建Y,需要用one-hot编码
if make_categorical:
wids=np.array([wids])
wids=sequence.pad_sequences(wids,maxlen=max_len)
data[i]=np.array(to_categorical(wids,num_classes=num_classes))
# 如果是构建X,直接用ID即可,因为后面会用Embedding层处理
else:
data[i]=wids
if(make_categorical):
pdata=np.array([d.reshape((d.shape[1],d.shape[2])) for d in data])
else:
pdata=sequence.pad_sequences(data,maxlen=max_len)
fin.close()
return pdata
X=build_tensor('treebank_sents.txt',s_num,s_word2index,MAX_SEQLEN)
Y=build_tensor('treebank_poss.txt',t_num,t_word2index,MAX_SEQLEN,
make_categorical=True,num_classes=t_vocabsize)
print(X.shape)
print(Y.shape)
(3914, 100)
(3914, 100, 46)
训练
from sklearn.model_selection import train_test_split
Xtrain,Xtest,Ytrain,Ytest=train_test_split(X,Y,test_size=0.2,random_state=42)
先来用原书中的Encoder-Decoder结构
from keras import Sequential
from keras.layers import Embedding,SpatialDropout1D,GRU,LSTM,RepeatVector,TimeDistributed,Activation
from keras.layers import Dense,TimeDistributed
from keras.activations import softmax
from keras.optimizers import Adam
from keras.losses import categorical_crossentropy
EMBED_SIZE=128
HIDDEN_SIZE=128
BATCH_SIZE=32
model=Sequential()
model.add(Embedding(s_vocabsize,EMBED_SIZE,
input_length=MAX_SEQLEN))
model.add(SpatialDropout1D(0.2))
model.add(GRU(HIDDEN_SIZE,dropout=0.2,recurrent_dropout=0.2))
model.add(RepeatVector(MAX_SEQLEN))
model.add(GRU(HIDDEN_SIZE,return_sequences=True))
model.add(TimeDistributed(Dense(t_vocabsize)))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy',optimizer='adam',
metrics=['accuracy'])
#model.summary()
需要注意这里由于是多对多模型,用到了一个TimeDistributed
连接,用于把全连接层用到每个时间步上的RNN单元,也就是最后一步中是多个箭头而不是一个箭头。
NUM_EPOCHS=1
model.fit(Xtrain,Ytrain,batch_size=BATCH_SIZE,epochs=NUM_EPOCHS,
validation_data=[Xtest,Ytest])
score,acc=model.evaluate(Xtest,Ytest,batch_size=BATCH_SIZE)
print('Test score:%.3f,accuracy:%.3f'%(score,acc))
结果如下:
Train on 3131 samples, validate on 783 samples
Epoch 1/1
3131/3131 [==============================] - 26s 8ms/step - loss: 1.5883 - acc: 0.7533 - val_loss: 1.2532 - val_acc: 0.7549
783/783 [==============================] - 1s 2ms/step
Test score:1.253,accuracy:0.755
75%,是不是看上去还行?我们实测几个数据来看一下。
前几个句子的应有的Output是像这样的:
NNP NNP , CD NNS JJ , MD VB DT NN IN DT JJ NN NNP CD .
NNP NNP VBZ NN IN NNP NNP , DT NNP VBG NN .
!head treebank_sents.txt>test_sent.txt
with open('test_sent.txt','r') as f:
t_num=len(f.readlines())
my_test=build_tensor('test_sent.txt',t_num,s_word2index,MAX_SEQLEN)
r=model.predict(my_test)
for i in r:
for w in i:
print(t_index2word[np.argmax(w)],end=" ")
print("\n")
然而我们的网络输出的结果却是
PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD
PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD
全都成了填充词!原因就在于我们把每个句子不管长短都填充成了100个单词,其中比较短的句子将大部分由PAD组成,最后输出的结果即使全是PAD也会有比较高的准确率。这种样本不均匀的问题可以通过修改Loss或是填充其他样本的方式解决。参考https://nlpforhackers.io/lstm-pos-tagger-keras/,这里作者修改了metric
参数,然而metric
对于训练并没有影响,仅仅反映测试阶段的表现,不过也可以让我们了解到模型真实的表现水平,我们加上去看一下。
def ignore_class_accuracy(to_ignore=0):
def ignore_accuracy(y_true, y_pred):
y_true_class = K.argmax(y_true, axis=-1)
y_pred_class = K.argmax(y_pred, axis=-1)
ignore_mask = K.cast(K.not_equal(y_pred_class, to_ignore), 'int32')
matches = K.cast(K.equal(y_true_class, y_pred_class), 'int32') * ignore_mask
accuracy = K.sum(matches) / K.maximum(K.sum(ignore_mask), 1)
return accuracy
return ignore_accuracy
然而效果还是很差……
为此我们怀疑是不是模型本身的问题,不再使用Encoder-Decoder模型,直接多到多输出。注意下面被注释掉的地方:
model=Sequential()
model.add(Embedding(s_vocabsize,EMBED_SIZE,
input_length=MAX_SEQLEN))
model.add(SpatialDropout1D(0.2))
#model.add(GRU(HIDDEN_SIZE,dropout=0.2,recurrent_dropout=0.2))
#model.add(RepeatVector(MAX_SEQLEN))
model.add(GRU(HIDDEN_SIZE,return_sequences=True))
model.add(TimeDistributed(Dense(t_vocabsize)))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy',optimizer='adam',
metrics=['accuracy'])
#model.summary()
重新fit一下,发现效果着实好多了
总结
RNN填充的长度和策略很重要,为此需要调整loss
和metrics
来让网络学习到我们真正想要的内容。另外从词性分析本身来说,某个单词所对应的词性基本上也是固定的,只要让网络学到这种映射关系准确率就能很高。由于在这个任务中前后时间的关系没那么强烈,且句子长度较长,导致Encoder的信息不那么充分,Decoder结构表现较差。