任务:句子级别的分类 用CNN
Word vectors,wherein words are projected from a sparse,1-of-V encoding (here V is the vocabulary size) onto a lower dimensional vector space via a hidden layer, are essentially feature extractors that encode semantic features of words in their dimensions。从一个稀疏的,1-of-V encoding(V是vocabulary size)通过一个hidden layer投影到低维的向量空间,是很关键的特征提取过程,可以encode semantic features语义特征。 在这种dense representations,语义相近的词更可能接近---在欧式距离或cosine距离---in the lower dimensional vector space。
CNN使用layers with convolving filters that applied to local features。本文:训练简单的CNN with one layer of convolution on top of word vectors obtained from an unsupervised neural language model。这些word vectors来自Mikolov et al.(2013)。
首先设置这些word vectors static and 学习模型的其他参数。hyperparameters需要很少的调整,这个模型就可以达到excellent results on multiple benchmarks基准,表示the pre-trained vectors are "universal" feature extractors,可以用于多种分类任务。
Learning task-specific vectors through fine-tuning results in further improvements. 学习任务具体的vectors可以有改进。
We finally describe a simple modification to the architecture to allow for the use of both pre-trained and task-specific vectors by having multiple channels. 允许pre-trained和task-specific向量,通过使用多种channels。
模型:
xi属于Rk是k维的word vector,句子中的第i个词,一个句子的长度为n(padded where necessary),表示为:
符号是concatenation operator 级联操作。
A convolution operation 卷积操作:一个filter:
is applied to a window of h words to produce a new feature. 例如, a feature ci is generated from a window of words Xi:i+h-1 by:
b属于R,是一个偏移项,f是非线性的函数,例如the hyperbolic tangent。这个filter应用到each possible window of words in sentence
然后用max-over-time pooling operation,根据filter获取最大的feature over the feature map。为了获取最重要的feature for each feature map。 pooling scheme 为了处理不同的句子长度
使用不同的filters,可以获得不同的features,这些features构成了倒数第二层,传递到fully connected softmax layer,这一层的输出是labels上的概率分布。
本文模型,有两个channels of word vectors---that is kept static throughout training and one that is fine-tuned via backpropagation.
Regularization:
为了正则化,在倒数第二层使用dropout, with a constraint on l2-norms of the weight vectors。Dropout防止隐藏层节点的co-adaptation,通过randomly dropping out---例如设置为0,a proportion p of the hidden units during forward-backpropagation.
element-wise multiplication:元素相乘操作。 r是一个"masking" vector of 伯努利随机变量,with probability p of being 1.
Gradients are backpropagated only through the unmasked units.
Dropout proved to be such a good regularizer that it was fine to use a larger than necessary network and simply let dropout regularize it. Drop consistently added 2%--4% relative performance.