LSTM入门

四个主要问题：

是什么？
为什么？
做什么？
怎么做？

本文主要根据Understanding LSTM Networks-colah's blog 编写，包括翻译并增加了自己浅薄的理解。

LSTM是什么？

以下定义摘自百度百科

LSTM(Long Short-Term Memory) 长短期记忆网络，是一种时间递归神经网络，适合于处理和预测时间序列中间隔和延迟相对较长的重要事件。

LSTM为什么产生？

RNN

一般神经网络没有考虑数据的持续影响。通常，前面输入神经元的数据对后输入的数据有影响。考虑到这点或者说为了解决传统神经网络不能捕捉/利用previous event affect the later ones，提出了RNN，网络中加入循环。下图是RNN网络示图。

RNN

RNN网络实质上是多个普通神经网络的连接，每个神经元向下一个传递信息，如下图所示:

RNN链式结构

"LSTMs",a very special kind of recurrent neural network which works,for many tasks,much much better tahn the standard version.

The Problem of Long-Term Dependencies[1]

RNNs模型可以connect previous information to the present task,such as using previous video frames might inform the understanding of the present frame.

RNNs如何实现上述目标呢？这需要按情况而定。

有时，我们只需要查看最近的信息来执行当前的任务。例如，考虑一个语言模型试图根据以前的单词预测下一个词。如果我们试图预测“the clouds are in the sky ”的最后一个词，我们不需要任何进一步的背景(上下文) - 很明显，下一个词将是sky。在这种情况下，当前任务训练时RNNs模型需要过去n个信息且n很小。the gap between the relevant information and the place that it’s needed is small

但是也有需要很多上下文信息的情况。如果我们试图预测长句的最后一个单词：Consider trying to predict the last word in the text “I grew up in France… I speak fluent French.”，最近的信息I speak fluent French表示/提示下一个单词可能是某种语言的名称，但是如果我们缩小范围到具体某种语言时，我们需要关于France的背景信息。那么使用RNNs训练时需要过去n个信息，且n要足够大。the gap between the relevant information and the point where it is needed to become very large

Unfortunately, as that gap grows, RNNs become unable to learn to connect the information.

理论上，RNNs可以处理“long-term dependencies.”,但是实际操作中，RNNs不能学习/训练这样的问题，即需要的过去信息n数量过大的情况下，RNNs将不再适用。The problem was explored in depth by Hochreiter (1991) [German] and Bengio, et al. (1994), who found some pretty fundamental reasons why it might be difficult.

LSTM模型可以处理“long-term dependencies”的问题

LSTM做什么？

应用

基于 LSTM 的系统可以学习翻译语言、控制机器人、图像分析、文档摘要、语音识别、图像识别、手写识别、控制聊天机器人、预测疾病、点击率和股票、合成音乐等等任务

现状

在 2015 年，谷歌通过基于CTC 训练的 LSTM 程序大幅提升了安卓手机和其他设备中语音识别的能力，其中就使用了Jürgen Schmidhuber的实验室在 2006 年发表的方法。百度也使用了 CTC；苹果的 iPhone 在 QucikType 和 Siri 中使用了 LSTM；微软不仅将 LSTM 用于语音识别，还将这一技术用于虚拟对话形象生成和编写程序代码等等。亚马逊 Alexa 通过双向 LSTM 在家中与你交流，而谷歌使用 LSTM 的范围更加广泛，它可以生成图像字幕，自动回复电子邮件，它包含在新的智能助手 Allo 中，也显著地提高了谷歌翻译的质量（从 2016 年开始）。目前，谷歌数据中心的很大一部分计算资源现在都在执行 LSTM 任务。

LSTM怎么做？

Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber (1997).

LSTMs are explicitly designed to avoid the long-term dependency problem.Remembering information for long periods of time is practically their default behavior, not something they struggle to learn!

All recurrent neural networks have the form of a chain of repeating modules of neural network. In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer.

因此标准的RNN模型具有神经网络模块链式结构，模块结构可以非常简单，比如只包含一个tanh layer，如下图所示：

LSTM

模块结构也可以非常复杂，如下图所示：

[图片上传失败...(image-72c315-1521165904331)]

接下来将遍历LSTM图示中的每个环节，在遍历之前，首先要了解图示中每个图形、符号的意思。

图示符号

In the above diagram, each line carries an entire vector, from the output of one node to the inputs of others. The pink circles represent pointwise operations, like vector addition, while the yellow boxes are learned neural network layers. Lines merging denote concatenation, while a line forking denote its content being copied and the copies going to different locations.

The Core Idea Behind LSTMs

The key to LSTMs is the cell state, the horizontal line running through the top of the diagram.

The cell state is kind of like a conveyor belt(输送带). It runs straight down the entire chain, with only some minor linear interactions. It’s very easy for information to just flow along it unchanged.

LSTM可以去除或增加cell state的信息,并被称为门(gates)的结构仔细调控。

Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation(逐点乘法运算).

forget gate layer

The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means “let nothing through,” while a value of one means “let everything through!”

An LSTM has three of these gates, to protect and control the cell state.

Step-by-Step LSTM Walk Through

第一步是选择cell state中要被丢弃的信息，这一步由被称为“forget gate layer”的sigmoid layer完成。sigmoid layer根据输入h_t-1和x_t，并为cell state C_t-1中每个值输出一个介于0-1之间的值。当输出为 1 表示完全保留这个cell state信息，当输出为 0 表示完全抛弃。比如说如果我们尝试利用语言模型，根据之前所有的背景信息来预测下一个词。在这样的问题中，cell state可能包括当前主体的性别，因此可以使用正确的代词。当我们看到一个新的主体时，我们想忘记旧主体的性别。

下图即为“forget gate layer”示图：

[图片上传失败...(image-4aad78-1521165904331)]

接下来选择/决定要存入到cell state的新信息。这步有两个部分。首先，被称为“input gate layer”的sigmoid layer决定我们将更新哪些值。接下来，tanh层创建一个新的候选值向量C^〜_t，可以添加到状态state中。在下一步中，我们将结合这两者来实现细胞状态cell state的更新。
在我们的语言模型的例子中，我们希望将新主体的性别添加到cell state中，以替换我们抛弃的旧主体性别信息。

下图为“input gate layer” + tanh layer示图：

input gate layer+ tanh layer

现在是时候将之前的cell state C_t-1更新为cell status C_t。之前的步骤已经决定要做什么，我们只需要真正做到这一点。

我们将旧状态C_t-1乘以f_t，忘记/抛弃我们早先决定抛弃的信息。然后加上i_t*C^〜_t。这是新的候选值，根据我们决定更新每个状态值的比例进行缩放。

就语言模型而言，这实现了我们实际放弃旧主体性别信息并添加新主体信息的操作。过程如下图所示：

更新cell state

最后，我们需要决定我们要输出的内容。这个输出将基于我们的cell state，但将是一个过滤版本。首先，我们运行一个sigmoid layer，它决定我们要输出的cell state的哪些部分。然后，将cell state 通过tanh（将值推到-1和1之间）并将其乘以sigmoid gate的输出，以便我们只输出决定输出的部分。

对于语言模型示例，由于它刚刚看到了一个主体，因此它可能需要输出与动词相关的信息，以防接下来会发生什么。例如，它可能会输出主体是单数还是复数，以便我们知道如果接下来是什么，应该将动词的形式结合到一起。这个部分是通过sigmoid layer实现cell state的过滤，根据过滤版本的cell state修改输出h_t.

上述过程如下图所示：

模型输出

Variants on Long Short Term Memory

One popular LSTM variant, introduced by Gers & Schmidhuber (2000), is adding “peephole connections.” This means that we let the gate layers look at the cell state.

"peephole connections"

Another variation is to use coupled(耦合) forget and input gates. Instead of separately deciding what to forget and what we should add new information to, we make those decisions together. We only forget when we’re going to input something in its place. We only input new values to the state when we forget something older.

coupled

A slightly more dramatic variation on the LSTM is the Gated Recurrent Unit, or GRU, introduced by Cho, et al. (2014). It combines the forget and input gates into a single “update gate.” It also merges the cell state and hidden state, and makes some other changes. The resulting model is simpler than standard LSTM models, and has been growing increasingly popular.

GRU

These are only a few of the most notable LSTM variants. There are lots of others, like Depth Gated RNNs by Yao, et al. (2015). There’s also some completely different approach to tackling long-term dependencies, like Clockwork RNNs by Koutnik, et al. (2014).

发展展望

LSTM以后的发展方向：

Attention:Xu, et al. (2015)
Grid LSTMs:Kalchbrenner, et al. (2015)
RNN in generative models:Gregor, et al. (2015),Chung, et al. (2015),Bayer & Osendorfer (2015)

参考

Understanding LSTM Networks-colah's blog