前言
本教程来自TensorFlow官方教程 tensorflow.org,网站和大部分学习资源被墙,请自主进行科学上网。文章旨在记录自己学习机器学习相关知识的过程,如对您学习过程有所助益,不胜荣幸。文章是个人翻译,水平有限,希望理解。教程采用Anaconda版本的Python,是Python3代码,编辑器是Pycharm。
这篇文章将会建立一个神经网络模型通过分析电影评论的文字,从而得出他是负面的还是正面的评论的分类。这是一个典型的二分类问题,一种重要且广泛适用的机器学习问题。
我们将会使用来自互联网影视数据库的IMDB dataset数据集. 它包含了50000条电影评论文本。我们会将它分为25000条训练数据和25000条测试数据,并且两者训练数据和测试数据负面和正面评价个数都是均等的。
本篇教程将会使用tf.keras框架。tf.keras,是TensorFlow中用于建立和训练模型的高级API。你可以使用以下Python代码导入keras框架:
import tensorflow as tf
from tensorflow import keras
import numpy as np
print(tf.__version__)
1.10.0
下载IMDB 数据集
现在IMDB数据集已经集成于TensorFlow.它已经被预处理为序列的整数,每一个整数代表着该单词在词典中的位置。
你可以使用下面的代码下载IMDB数据集,如果你已经下载,使用下面代码会直接读取该数据集:
imdb = keras.datasets.imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
17465344/17464789 [==============================] - 0s 0us/step
参数num_words = 10000 表示数据集将会保留出现频率在前10000的单词,有些稀有单词将会被抛弃以保证数据的可处理性。
查看数据
让我们花一些时间来了解一下数据集的形式。数据集已经被预处理过了,每一个电影评论都是都是一长串的整数数字,代表着每个单词在字典中的位置。每一个标签都是一个数字0或1,0代表该条评价是负面的,1代表该条评价是正面的。
print("Training entries: {}, labels: {}".format(len(train_data), len(train_labels)))
查看训练集的大小
Training entries: 25000, labels: 25000
每个评论都被转换成了长串整数,我们看到的每一个训练数据是这样的:
print(train_data[0])
[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]
电影评论基本上长度都不会相等。但是输入神经网络的数据必须是相等的,我们稍后将会解决这个问题。
print(len(train_data[0]), len(train_data[1]))
(218, 189)
把数字转化为单词
有时候能把数字转化成单词将会很有用。这里我们写了一个函数用于将数字映射为词典中的单词:
# A dictionary mapping words to an integer index
word_index = imdb.get_word_index()
# The first indices are reserved
word_index = {k:(v+3) for k,v in word_index.items()}
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2 # unknown
word_index["<UNUSED>"] = 3
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
def decode_review(text):
return ' '.join([reverse_word_index.get(i, '?') for i in text])
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json
1646592/1641221 [==============================] - 0s 0us/step
现在我们可以尝试解码一个电影评论:
print(decode_review(train_data[0]))
<START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert <UNK> is an amazing actor and now the same being director <UNK> father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for <UNK> and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also <UNK> to the two little boy's that played the <UNK> of norman and paul they were just brilliant children are often left out of the <UNK> list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all
准备数据
影视评价文本想要输入到神经网络中必须将其转化为张量,为此有多种途径来解决这个问题:
- 对数组进行独热码处理(one-hot-encode),将其转化为0和1的向量例如,序列[3,5]将成为10,000维向量,除了索引3和5(它们是1)之外全部为零。然后,将其作为我们网络中的第一层 - 一个可以处理浮点矢量数据的全连接层。但是,这种方法太消耗内存,需要num_words * num_reviews大小矩阵。
- 或者,我们可以填充数组,使它们都具有相同的长度,然后创建一个num_examples * max_length的张量。我们可以使用能够处理这种形状的嵌入层作为我们网络中的第一层。
本文将使用第二种方法。
由于电影评论的长度必须相同,我们将使用pad_sequences函数来标准化长度
train_data = keras.preprocessing.sequence.pad_sequences(train_data,
value=word_index["<PAD>"],
padding='post',
maxlen=256)
test_data = keras.preprocessing.sequence.pad_sequences(test_data,
value=word_index["<PAD>"],
padding='post',
maxlen=256)
我们来看一下现在的长度:
print(len(train_data[0]), len(train_data[1]))
(256, 256)
然后我们再看一下被pad过的第一条评论的内容:
print(train_data[0])
[ 1 14 22 16 43 530 973 1622 1385 65 458 4468 66 3941
4 173 36 256 5 25 100 43 838 112 50 670 2 9
35 480 284 5 150 4 172 112 167 2 336 385 39 4
172 4536 1111 17 546 38 13 447 4 192 50 16 6 147
2025 19 14 22 4 1920 4613 469 4 22 71 87 12 16
43 530 38 76 15 13 1247 4 22 17 515 17 12 16
626 18 2 5 62 386 12 8 316 8 106 5 4 2223
5244 16 480 66 3785 33 4 130 12 16 38 619 5 25
124 51 36 135 48 25 1415 33 6 22 12 215 28 77
52 5 14 407 16 82 2 8 4 107 117 5952 15 256
4 2 7 3766 5 723 36 71 43 530 476 26 400 317
46 7 4 2 1029 13 104 88 4 381 15 297 98 32
2071 56 26 141 6 194 7486 18 4 226 22 21 134 476
26 480 5 144 30 5535 18 51 36 28 224 92 25 104
4 226 65 16 38 1334 88 12 16 283 5 16 4472 113
103 32 15 16 5345 19 178 32 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0]
建立模型
神经网络是由层的堆叠来实现的,因此需要两个架构性的决策:
- 在这个模型中需要多少层
- 每一层需要多少隐藏的神经元
在这个例子中,我们的输入层是包含单词索引的数组。预测的标签是0或者1。我们可以建立这样一个模型来解决这个问题:
# input shape is the vocabulary count used for the movie reviews (10,000 words)
vocab_size = 10000
model = keras.Sequential()
model.add(keras.layers.Embedding(vocab_size, 16))
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(16, activation=tf.nn.relu))
model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid))
model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, None, 16) 160000
_________________________________________________________________
global_average_pooling1d (Gl (None, 16) 0
_________________________________________________________________
dense (Dense) (None, 16) 272
_________________________________________________________________
dense_1 (Dense) (None, 1) 17
=================================================================
Total params: 160,289
Trainable params: 160,289
Non-trainable params: 0
_________________________________________________________________
该模型的堆叠方式是这样的:
1.第一层是嵌入层。 该层采用整数编码的词汇表,并查找每个词索引的嵌入向量。 这些向量是作为模型训练学习的。 向量为输出数组添加维度。 生成的维度为:(batch, sequence, embedding)。如需进一步了解,可参考官网keras的嵌入层文档。
2.接下来,GlobalAveragePooling1D层通过对序列维度求平均,为每个示例返回固定长度的输出向量。 这允许模型以最简单的方式处理可变长度的输入。
3.一个拥有16个神经元的全连接层
4.全连接层,也是输出层,激活函数采用sigmoid,输出一个0和1之间的浮点数,用来表示置信度。
配置本教程使用的优化器和损失函数:
model.compile(optimizer=tf.train.AdamOptimizer(),
loss='binary_crossentropy',
metrics=['accuracy'])
创建验证数据集
当我们训练神经网络的时候,我们想要不断检查我们神经网络的精度。我们可以从原训练数据分离出10000组数据作为验证数据集。(为什么不用测试数据呢?因为我们的目标是开发一个神经网络并使用训练数据不断的调节神经网络,最终仅使用测试数据来评估我们的模型。)
x_val = train_data[:10000]
partial_x_train = train_data[10000:]
y_val = train_labels[:10000]
partial_y_train = train_labels[10000:]
训练模型
history = model.fit(partial_x_train,
partial_y_train,
epochs=40,
batch_size=512,
validation_data=(x_val, y_val),
verbose=1)
训练该模型使用15000个样本,batch_size 为512,共训练了40个epoch
在训练的同时会记录与验证数据集的对比结果。
Train on 15000 samples, validate on 10000 samples
Epoch 1/40
15000/15000 [==============================] - 1s 43us/step - loss: 0.6951 - acc: 0.5043 - val_loss: 0.6929 - val_acc: 0.5117
Epoch 2/40
15000/15000 [==============================] - 0s 27us/step - loss: 0.6912 - acc: 0.5311 - val_loss: 0.6903 - val_acc: 0.5281
Epoch 3/40
15000/15000 [==============================] - 0s 28us/step - loss: 0.6893 - acc: 0.5553 - val_loss: 0.6888 - val_acc: 0.5674
Epoch 4/40
15000/15000 [==============================] - 0s 27us/step - loss: 0.6870 - acc: 0.5961 - val_loss: 0.6866 - val_acc: 0.5853
Epoch 5/40
15000/15000 [==============================] - 0s 27us/step - loss: 0.6841 - acc: 0.6161 - val_loss: 0.6831 - val_acc: 0.6584
Epoch 6/40
15000/15000 [==============================] - 0s 29us/step - loss: 0.6802 - acc: 0.6869 - val_loss: 0.6789 - val_acc: 0.6999
Epoch 7/40
15000/15000 [==============================] - 0s 28us/step - loss: 0.6746 - acc: 0.7159 - val_loss: 0.6735 - val_acc: 0.7093
Epoch 8/40
15000/15000 [==============================] - 0s 28us/step - loss: 0.6670 - acc: 0.7367 - val_loss: 0.6654 - val_acc: 0.7395
Epoch 9/40
15000/15000 [==============================] - 0s 26us/step - loss: 0.6569 - acc: 0.7586 - val_loss: 0.6546 - val_acc: 0.7523
Epoch 10/40
15000/15000 [==============================] - 0s 26us/step - loss: 0.6436 - acc: 0.7728 - val_loss: 0.6408 - val_acc: 0.7585
Epoch 11/40
15000/15000 [==============================] - 0s 26us/step - loss: 0.6274 - acc: 0.7625 - val_loss: 0.6245 - val_acc: 0.7662
Epoch 12/40
15000/15000 [==============================] - 0s 26us/step - loss: 0.6074 - acc: 0.7823 - val_loss: 0.6051 - val_acc: 0.7710
Epoch 13/40
15000/15000 [==============================] - 0s 26us/step - loss: 0.5840 - acc: 0.7901 - val_loss: 0.5840 - val_acc: 0.7785
Epoch 14/40
15000/15000 [==============================] - 0s 26us/step - loss: 0.5583 - acc: 0.8007 - val_loss: 0.5589 - val_acc: 0.7902
Epoch 15/40
15000/15000 [==============================] - 0s 27us/step - loss: 0.5305 - acc: 0.8105 - val_loss: 0.5331 - val_acc: 0.7982
Epoch 16/40
15000/15000 [==============================] - 0s 26us/step - loss: 0.5029 - acc: 0.8191 - val_loss: 0.5087 - val_acc: 0.8046
Epoch 17/40
15000/15000 [==============================] - 0s 27us/step - loss: 0.4750 - acc: 0.8329 - val_loss: 0.4848 - val_acc: 0.8184
Epoch 18/40
15000/15000 [==============================] - 0s 26us/step - loss: 0.4487 - acc: 0.8433 - val_loss: 0.4618 - val_acc: 0.8260
Epoch 19/40
15000/15000 [==============================] - 0s 27us/step - loss: 0.4241 - acc: 0.8540 - val_loss: 0.4409 - val_acc: 0.8339
Epoch 20/40
15000/15000 [==============================] - 0s 27us/step - loss: 0.4015 - acc: 0.8639 - val_loss: 0.4221 - val_acc: 0.8411
Epoch 21/40
15000/15000 [==============================] - 0s 26us/step - loss: 0.3806 - acc: 0.8711 - val_loss: 0.4051 - val_acc: 0.8465
Epoch 22/40
15000/15000 [==============================] - 0s 26us/step - loss: 0.3619 - acc: 0.8765 - val_loss: 0.3903 - val_acc: 0.8513
Epoch 23/40
15000/15000 [==============================] - 0s 26us/step - loss: 0.3454 - acc: 0.8809 - val_loss: 0.3776 - val_acc: 0.8564
Epoch 24/40
15000/15000 [==============================] - 0s 26us/step - loss: 0.3302 - acc: 0.8859 - val_loss: 0.3663 - val_acc: 0.8595
Epoch 25/40
15000/15000 [==============================] - 0s 26us/step - loss: 0.3169 - acc: 0.8899 - val_loss: 0.3566 - val_acc: 0.8622
Epoch 26/40
15000/15000 [==============================] - 0s 27us/step - loss: 0.3048 - acc: 0.8931 - val_loss: 0.3481 - val_acc: 0.8650
Epoch 27/40
15000/15000 [==============================] - 0s 27us/step - loss: 0.2941 - acc: 0.8965 - val_loss: 0.3407 - val_acc: 0.8680
Epoch 28/40
15000/15000 [==============================] - 0s 27us/step - loss: 0.2839 - acc: 0.8991 - val_loss: 0.3341 - val_acc: 0.8701
Epoch 29/40
15000/15000 [==============================] - 0s 26us/step - loss: 0.2748 - acc: 0.9022 - val_loss: 0.3286 - val_acc: 0.8719
Epoch 30/40
15000/15000 [==============================] - 0s 27us/step - loss: 0.2669 - acc: 0.9043 - val_loss: 0.3235 - val_acc: 0.8720
Epoch 31/40
15000/15000 [==============================] - 0s 26us/step - loss: 0.2585 - acc: 0.9082 - val_loss: 0.3192 - val_acc: 0.8753
Epoch 32/40
15000/15000 [==============================] - 0s 27us/step - loss: 0.2518 - acc: 0.9101 - val_loss: 0.3154 - val_acc: 0.8755
Epoch 33/40
15000/15000 [==============================] - 0s 26us/step - loss: 0.2443 - acc: 0.9119 - val_loss: 0.3121 - val_acc: 0.8754
Epoch 34/40
15000/15000 [==============================] - 0s 26us/step - loss: 0.2378 - acc: 0.9154 - val_loss: 0.3089 - val_acc: 0.8757
Epoch 35/40
15000/15000 [==============================] - 0s 27us/step - loss: 0.2320 - acc: 0.9161 - val_loss: 0.3060 - val_acc: 0.8769
Epoch 36/40
15000/15000 [==============================] - 0s 27us/step - loss: 0.2257 - acc: 0.9195 - val_loss: 0.3038 - val_acc: 0.8774
Epoch 37/40
15000/15000 [==============================] - 0s 27us/step - loss: 0.2203 - acc: 0.9214 - val_loss: 0.3019 - val_acc: 0.8778
Epoch 38/40
15000/15000 [==============================] - 0s 28us/step - loss: 0.2150 - acc: 0.9232 - val_loss: 0.2993 - val_acc: 0.8786
Epoch 39/40
15000/15000 [==============================] - 0s 27us/step - loss: 0.2096 - acc: 0.9257 - val_loss: 0.2977 - val_acc: 0.8792
Epoch 40/40
15000/15000 [==============================] - 0s 27us/step - loss: 0.2047 - acc: 0.9275 - val_loss: 0.2959 - val_acc: 0.8803
评估模型
我们来看一下我们训练的模型表现如何
results = model.evaluate(test_data, test_labels)
print(results)
25000/25000 [==============================] - 0s 13us/step
[0.3104253210735321, 0.87236]
可以看出模型的精确度已经达到了87%,随着训练次数的增加,这个值可能最终会逼近95%
画一个图来查看过拟合的状况
model.fit()这个函数会返回在训练过程中的历史数据,在TensorFlow中会以History 对象的形式存在,它是一个dictionary,我们可以打印它来观察它的结构:
history_dict = history.history
history_dict.keys()
dict_keys(['loss', 'val_loss', 'acc', 'val_acc'])
我们可以使用下面的绘图代码来更直观的查看过拟合的情况:
import matplotlib.pyplot as plt
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)
# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
plt.clf() # clear figure
acc_values = history_dict['acc']
val_acc_values = history_dict['val_acc']
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()
在该图中,点表示训练过程中的损失值和精确度,实线表示验证数据集的损失值和精确度。
可以看出,训练损失值随着每个epoch而减少,并且训练精确度随epoch增加而增加。这在使用梯度下降优化时是符合预期的。
验证精确度和训练精确度在20个epoch之后稍微有一些“分道扬镳”。这是过度拟合的一个例子:模型在训练数据上的表现比在以前从未见过的数据上表现得更好。在此之后,模型由于过度优化,无法将结果更精准的适用于测试数据了。
对于这种特殊情况,我们可以通过在二十个左右的epoch之后停止训练来防止过度拟合。在以后的教程中,您将看到如何使用回调自动执行此操作。
#title MIT License
#
# Copyright (c) 2017 François Chollet
#
# Permission is hereby granted, free of charge, to any person obtaining a
# copy of this software and associated documentation files (the "Software"),
# to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense,
# and/or sell copies of the Software, and to permit persons to whom the
# Software is furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
# DEALINGS IN THE SOFTWARE.