神经网络基础之 Optimizer 详解

参考引用：
Blog: An overview of gradient descent optimization algorithms
review paper: An overview of gradient descent optimization algorithms

在搭积木的时候往往面临要选择 optimizer，虽然说经验指导下选来选去都是那几个（玄学炼丹），但是作为一个有 wei 点 le 志 zhuang 向 bi 的科研菜鸟，还是想自己整理学习一下各个 optimizer 的特点，留下此篇。

Gradient Descent 梯度下降

Gradient descent is one of the most popular algorithms to perform optimization in Neural network. Nearly all of the state-of-the -art Deep Learning Library contains implementations of various algorithms to optimize gradient descent.

在台湾大学李宏毅教授的《机器学习》课程中，李宏毅教授把机器学习总结为三个 step：

step 1 ： define a set of function as model
step 2 : find a function evaluate the goodness of function
step3： pick the best function （optimization）

Gradient descent 是用于 minimize the object function (or say, the loss function) 的一种推广方法。我们知道在很多机器学习的课程中，介绍 gradient descent 都从多项式函数的优化开始讲的，也就是类似 $y = w_1x+w_2x_2+b$ 这样的函数求极值问题。学过一点数值计算的话我们就知道可以用最小二乘法来求解这种问题，但是最小二乘一般来说是针对于线性方程组的求解方法，所以为了推广到非线性的问题，使用 gradient descent 是不错的选择。

$\star$ gradient descent 算法详解：http://cs231n.github.io/optimization-1/

Gradient descent variants

在处理很多很多 samples，或者说在训练数据集 dataset 的时候，我们就要对最原始介绍的 gradient descent 做变动（其实就是把算法搬到数据集上应用）。

数据集的大小会影响运行 gradient descent 的速度，因为 gradient descent 的求导/求偏导运算是很花时间的，求一次导数更新一次参数值。如果一次性处理太多 sample 的话，那更新一次参数也未免太慢了；所以就有了设置 batch 的想法：一次就计算一批，由 batch 的 loss function 梯度来决定（近似）整个 dataset 的 loss function 梯度。

根据数据集的大小，我们需要 make a trade-off between the accuracy of the parameter update and the time it takes to perform an update. 这里有三种 variants：

Batch gradient descent ( vanilla gradient descent )
Stocastic gradient descent
Mini-batch gradient descent
下面一个一个讲：

Batch gradient descent ( vanilla gradient descent )

Traning set 不管有多少 sample，全都拿去计算 gradient. 训练集在 500 以下都可以直接用。
$\theta = \theta - \eta \cdot \nabla_\theta J(\theta)$
Batch gradient descent can be very slow and is intractable for datasets that don't fit in memory. Batch gradient descent also doesn't allow us to update our model online .

Batch gradient descent is guaranteed to converge to the global minimum for convex error surfaces and to a local minimum for non-convex surfaces.

伪代码如下

for i in range( number_epoch ):
　　parameters_gradient = evaluate_gradient(loss_function, full_data, parameters)
　　parameters = parameters - learning_rate * parameters_gradient

缺点： Batch gradient descent performs redundant computations for large datasets, as it recomputes gradients for similar examples before each parameter update

Stochastic gradient descent (SGD)

Stochastic gradient descent performs a parameter update for each training example $x^{(i)}$ and laybel $y^{(i)}$ . 来一个 sample 算一次 gradient。

$\theta = \theta - \eta \cdot \nabla _\theta J (\theta; [x^{(i)},y^{(i)}])$

SGD does away with redundancy of recomputes gradient by performing one update at a time. It is therefore usually much faster and can also be used to learn online.

不过每来一个 example 就 update 一次参数未免也太频繁了，而且每次来的 example 其实个体差异性挺大的，这样导致最后 object function 波动很厉害：

Image 1: SGD fluctuation (Source: [Wikipedia])

SGD's fluctuation enables it to jump to new and potentially better local minima. On the other hand, this ultimately complicates convergence to the exact minimum, as SGD will keep overshooting.
　　不过，只要我们缓慢的减小 learning rate，SGD 就可以表现出和 Batch GD一样的性能 —— 陷入全局最优 or 局部最优。

伪代码如下：

for i in range(number_epochs):
　　np.random.shuffle(data)
　　for example in full_data:
　　　　parameters_gradient = evaluate_gradient(loss_function, example, parameters)
　　　　parameters = parameters - learning_rate * parameters_gradient

Mini-batch gradient descent

结合上述两种方法的利弊（其实上面两种方法对应两个极端情况），我们取一个 Batch (批次)，一批数据算一次 loss 然后算偏导进行参数更新。假设一个 batch 有 n 个 example，也就是 batch_size = n, 有：

$\theta = \theta - \eta \cdot \nabla_\theta J(\theta; [x^{(i:i+n)},y^{(i:i+n)}])$

设置 batch 的好处在于：

reduces the variance of the parameter updates, which can lead to more stable convergence
can make use of highly optimized matrix optimizations common to state-of-the-art deep learning libraries that make computing the gradient w.r.t. a mini-batch very efficient

常用的 batch 大小： 50、100、128、256.... 有的观点认为用 2 的幂次会更快。

伪代码如下：

for i in range(number_epochs):
　　np.random.shuffle(data)
　　for batch in get_batch(full_data, batch_size=128):
　　　　parameter_gradient = evaluate_gradient(loss_function, batch, parameter)
　　　　parameter = parameter - learning_rate * parameter_gradient

Challenge 困难与挑战

照这么看，似乎 Gradient Descent 已经很好的满足了我们的使用需求了。然而进一步实践可以发现即使是 mini_batch GD 也存在很多问题：我们只解决了由单次运算数据量引起的梯度下降准确性 accuracy 问题与运算时间（效率）的问题，还有其他困难和挑战：

How to choose a proper learning rate $\eta$ ?
If we use 'Learning Rate Schedules', i.e. adjusting (or say, reducing) the learning rate according to a pre-defined schedule or when the change in objective between epochs falls below a threshold. These schedules and thresholds have to be defined in advance and are thus unable to adapt to a dataset's characteristics.
Additionally, the same learning rate applies to all parameter updates. If our data is sparse and our features have very different frequencies, we might not want to update all of them to the same extent, but perform a larger update for rarely occurring features.
Another key challenge of minimizing highly non-convex error functions common for neural networks is avoiding getting trapped in their numerous suboptimal local minima & saddle points. For saddle points, they are usually surrounded by a plateau of the same error, which makes it notoriously hard for SGD to escape, as the gradient is close to zero in all dimensions.

面对以上这些问题与挑战，有许多改进的方法不断涌现：

Momentum 动量

讨论一个二维的优化问题：假设两个维度上的梯度分布是不同步的，比方说梯度曲线为椭圆形（这样的情况经常出现在 local minima 附近），如下图所示。图中最中间的小椭圆为梯度引导的最低处，可以把这个图想象成一个峡谷，SGD 的优化目标就是要优化到最中间的低处去。

在这种情况下，SGD 会在峡谷的两侧表面频繁的震荡，使得整个优化过程十分缓慢。

Momentum 就是帮助 SGD 在这种条件下进行加速的一种方法：在每次对 parameter vector $\theta$ 进行更新时，加上一个动量项，即上一次梯度更新的步长 $\times$ 很小的常量 $\gamma$ ：

Vanilla_SGD:
$\theta = \theta - \eta \cdot \nabla_\theta J(\theta)$
SGD_with_ Momentum :
$v_t = \gamma v_{t-1} + \eta\nabla_\theta J(\theta) \\ \\ \theta = \theta - v_t$

$\gamma$ 常常取 0.9、0.8 等等，而初速度 $v_0$ 其实自己定义就行，初速度为0就可以，初始梯度会赋予它更新的速度值 $v_1$

在使用 Momentum 的时候，由于动量的不断累积，在参数 update 过程中相同方向的 update 会越来越快，可以冲过一些 saddle point，使得算法更快收敛、震荡次数更少。

Nesterov accelerated gradient

重新想回 Momentum，我们设想假如一个 minima 位于两座“山”之间，由于 Momentum 的存在，小球总是要往两边滚来滚去往复很久才能最终停在中间位置。有没有一种方法能够让小球能够在小球准备冲上下一座“山”之前就减速呢？

Nesterov accelerated gradient (NAG) 就是这样一种方法。

回想一下我们在 Momentum 方法中用到的动量项 $\gamma v_{t-1}$ 代表了上一次梯度更新的影响， $\theta - \gamma v_{t-1}$ 表示参数 $\theta$ 减去动量项的影响后，下一次参数应该更新的 “大概位置”。这里说大概位置是因为真实的下一次参数应该更新的位置其实是 $\theta- \gamma v_{t-1} - \eta\nabla_\theta J(\theta)$ . 如果我们知道了下一次位置的近似值，根据下一次的位置求偏导求更新的步长来指导当下的运动，那么算法就有了 “预见性 prescience”，这样就可以让 $\theta$ 在参数准备到达 minima 的时候提前减速，而不用过多受 Momentum term 的影响 push 到更远的地方。

SGD with Momentum :
$v_t = \gamma v_{t-1} + \eta\nabla_\theta J(\theta) \\ \\ \theta = \theta - v_t$
Nesterov accelerated gradient :
$v_t = \gamma v_{t-1} + \eta\nabla_\theta J(\theta- \gamma v_{t-1}) \\ \\ \theta = \theta - v_t$

更多更具体的数学解释，可以参照：http://cs231n.github.io/neural-networks-3/

上面的方法已经可以让 optimizer 根据 Loss function surface 自适应的进行加速/减速了，但科学家还不满足，之前计算 gradient 都是计算 $\nabla _\theta$ ， $\theta$ 各个分量（维度）在统一的 learning rate 下进行更新。我们还希望根据每个单独的参数进行调整更新，以便根据它们的重要性执行较大或较小的 update。

Adagrad

Adagrad 的好处在于：It adapts the learning rate to the parameters.

所谓的 “Adapts 自适应” 就是说，在 frequently occurring features 时能够减小 learning rate $\eta$ ；在 infrequently occurring features 时能够增大 learning rate $\eta$ .什么意思呢？就是当优化过程中一些 feature 频繁出现的时候（例如说 loss function 的 error 不太怎么改变的时候），表示算法接近收敛了，这个时候就要减小 learning rate $\eta$ ，也就是说减少步长去逐步接近那个 minima.

For this reason, Adagrad is well-suited for dealing with sparse data.

RMSprop

最后编辑于：2019.04.15 11:10:37

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 204,445评论 6赞 478
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 85,889评论 2赞 381
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 151,047评论 0赞 337
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 54,760评论 1赞 276
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 63,745评论 5赞 367
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,638评论 1赞 281
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 38,011评论 3赞 398
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,669评论 0赞 258
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 40,923评论 1赞 299
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,655评论 2赞 321
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,740评论 1赞 330
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,406评论 4赞 320
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 38,995评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,961评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,197评论 1赞 260
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 45,023评论 2赞 350
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,483评论 2赞 342