《Convolutional Neural Networks for Sentence Classification》阅读笔记

任务：句子级别的分类用CNN

Word vectors，wherein words are projected from a sparse，1-of-V encoding (here V is the vocabulary size) onto a lower dimensional vector space via a hidden layer, are essentially feature extractors that encode semantic features of words in their dimensions。从一个稀疏的，1-of-V encoding（V是vocabulary size）通过一个hidden layer投影到低维的向量空间，是很关键的特征提取过程，可以encode semantic features语义特征。在这种dense representations，语义相近的词更可能接近---在欧式距离或cosine距离---in the lower dimensional vector space。

CNN使用layers with convolving filters that applied to local features。本文：训练简单的CNN with one layer of convolution on top of word vectors obtained from an unsupervised neural language model。这些word vectors来自Mikolov et al.(2013)。

首先设置这些word vectors static and 学习模型的其他参数。hyperparameters需要很少的调整，这个模型就可以达到excellent results on multiple benchmarks基准，表示the pre-trained vectors are "universal" feature extractors，可以用于多种分类任务。

Learning task-specific vectors through fine-tuning results in further improvements. 学习任务具体的vectors可以有改进。

We finally describe a simple modification to the architecture to allow for the use of both pre-trained and task-specific vectors by having multiple channels. 允许pre-trained和task-specific向量，通过使用多种channels。

模型：

xi属于Rk是k维的word vector，句子中的第i个词，一个句子的长度为n（padded where necessary），表示为：

句子表示

符号是concatenation operator 级联操作。

A convolution operation 卷积操作：一个filter：

filter

is applied to a window of h words to produce a new feature. 例如， a feature ci is generated from a window of words Xi:i+h-1 by：

卷积操作

b属于R，是一个偏移项，f是非线性的函数，例如the hyperbolic tangent。这个filter应用到each possible window of words in sentence

句子

卷积后

然后用max-over-time pooling operation，根据filter获取最大的feature over the feature map。为了获取最重要的feature for each feature map。 pooling scheme 为了处理不同的句子长度

使用不同的filters，可以获得不同的features，这些features构成了倒数第二层，传递到fully connected softmax layer，这一层的输出是labels上的概率分布。

本文模型，有两个channels of word vectors---that is kept static throughout training and one that is fine-tuned via backpropagation.

Regularization：

为了正则化，在倒数第二层使用dropout， with a constraint on l2-norms of the weight vectors。Dropout防止隐藏层节点的co-adaptation，通过randomly dropping out---例如设置为0，a proportion p of the hidden units during forward-backpropagation.

element-wise multiplication：元素相乘操作。 r是一个"masking" vector of 伯努利随机变量，with probability p of being 1.

Gradients are backpropagated only through the unmasked units.

Dropout proved to be such a good regularizer that it was fine to use a larger than necessary network and simply let dropout regularize it. Drop consistently added 2%--4% relative performance.

最后编辑于：2017.12.10 20:13:25

《Convolutional Neural Networks for Sentence Classification》阅读笔记

推荐阅读更多精彩内容