Mini-batch Stochastic Gradient Descent

实现源自 neural networks and deep learning 第二章，详情请参考本书。

实现一个基于SGD学习算法的神经网络，使用BP算法计算梯度。

Network类定义

class Network():

初始化方法

    def __init__(self, sizes):
        """The list ``sizes`` contains the number of neurons in the
        respective layers of the network.  For example, if the list
        was [2, 3, 1] then it would be a three-layer network, with the
        first layer containing 2 neurons, the second layer 3 neurons,
        and the third layer 1 neuron.  The biases and weights for the
        network are initialized randomly, using a Gaussian
        distribution with mean 0, and variance 1.  Note that the first
        layer is assumed to be an input layer, and by convention we
        won't set any biases for those neurons, since biases are only
        ever used in computing the outputs from later layers."""

        self.num_layers = len(sizes)
        self.sizes = sizes
        self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
        self.weights = [np.random.randn(y, x) 
                        for x, y in zip(sizes[:-1], sizes[1:])]

我们可以看到相应的参数，sizes列表包含了网络中相应层的神经元个数。例如，如果列表是[2,3,1]，那么这个网络就是三层的神经网络，第一层有2个节点，第二层3个，最后一层1个。biases 和 weights 使用高斯分布mean ＝ 0, variance ＝ 1 随机初始化。注意首层一般是作为输入层，该层不包含 biases。
这里使用了 numpy 库的 random 模块进行高斯分布的采样。

定义 `feedforward` 方法：

    def feedforward(self, a):
        """Return the output of the network if ``a`` is input."""
        for b, w in zip(self.biases, self.weights):
            a = sigmoid_vec(np.dot(w, a)+b)
        return a

向量a作为输入时的网络输出，其结果是一个向量，sigmoid_vec 作用于向量中的每个元素。

定义 `SGD` 方法：

    def SGD(self, training_data, epochs, mini_batch_size, eta,
            test_data=None):
        """Train the neural network using mini-batch stochastic
        gradient descent.  The ``training_data`` is a list of tuples
        ``(x, y)`` representing the training inputs and the desired
        outputs.  The other non-optional parameters are
        self-explanatory.  If ``test_data`` is provided then the
        network will be evaluated against the test data after each
        epoch, and partial progress printed out.  This is useful for
        tracking progress, but slows things down substantially."""

        if test_data: n_test = len(test_data)
        n = len(training_data)
        for j in xrange(epochs):
            random.shuffle(training_data)
            mini_batches = [
                training_data[k:k+mini_batch_size]
                for k in xrange(0, n, mini_batch_size)]
            for mini_batch in mini_batches:
                self.update_mini_batch(mini_batch, eta)
            if test_data:
                print "Epoch {0}: {1} / {2}".format(
                    j, self.evaluate(test_data), n_test)
            else:
                print "Epoch {0} complete".format(j)

使用 mini-batch SGD 训练神经网络。training_data 训练样本的列表，包含训练输入和目标输出。test_data 如果指定，神经网络会在每个 epoch 后在测试集上进行评估，部分过程会打印出来。这对于追踪进度很有用，但在一定程度上会降低运行速度。
在每个 epoch，会对训练数据集进行洗牌，然后在丛中选择训练子集。
细节：以mini_batch_size为大小切割整个数据集。
遍历该次划分完备的数据集，使用 update_mini_batch 方法进行参数调整。
如果是测试数据则打印相应的 epoch，验证结果和测试用例个数。

定义 `update_mini_batch`

    def update_mini_batch(self, mini_batch, eta):
        """Update the network's weights and biases by applying
        gradient descent using backpropagation to a single mini batch.
        The ``mini_batch`` is a list of tuples ``(x, y)``, and ``eta``
        is the learning rate."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        for x, y in mini_batch:
            delta_nabla_b, delta_nabla_w = self.backprop(x, y)
            nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
            nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
        self.weights = [w-(eta/len(mini_batch))*nw 
                        for w, nw in zip(self.weights, nabla_w)]
        self.biases = [b-(eta/len(mini_batch))*nb 
                       for b, nb in zip(self.biases, nabla_b)]

在一次 mini_batch 中使用基于BP的GD来更新权重和偏置。mini_batch是一个tuple的list，[(x, y)]。而eta 则是学习率。

Paste_Image.png

定义 `backprop` 方法：

    def backprop(self, x, y):
        """Return a tuple ``(nabla_b, nabla_w)`` representing the
        gradient for the cost function C_x.  ``nabla_b`` and
        ``nabla_w`` are layer-by-layer lists of numpy arrays, similar
        to ``self.biases`` and ``self.weights``."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        # feedforward
        activation = x
        activations = [x] # list to store all the activations, layer by layer
        zs = [] # list to store all the z vectors, layer by layer
        for b, w in zip(self.biases, self.weights):
            z = np.dot(w, activation)+b
            zs.append(z)
            activation = sigmoid_vec(z)
            activations.append(activation)
        # backward pass
        delta = self.cost_derivative(activations[-1], y) * \
            sigmoid_prime_vec(zs[-1])
        nabla_b[-1] = delta
        nabla_w[-1] = np.dot(delta, activations[-2].transpose())
        # Note that the variable l in the loop below is used a little
        # differently to the notation in Chapter 2 of the book.  Here,
        # l = 1 means the last layer of neurons, l = 2 is the
        # second-last layer, and so on.  It's a renumbering of the
        # scheme in the book, used here to take advantage of the fact
        # that Python can use negative indices in lists.
        for l in xrange(2, self.num_layers):
            z = zs[-l]
            spv = sigmoid_prime_vec(z)
            delta = np.dot(self.weights[-l+1].transpose(), delta) * spv
            nabla_b[-l] = delta
            nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
        return (nabla_b, nabla_w)

算法流程

BP 公式

BP 算法

返回tuple (nabla_b, nabla_w)，包含对于代价函数 C_x 的梯度。nabla_b 和 nabla_w 是层-层numpy 数组的列表，类似于 self.biases 和 self.weights。
首先进行 feedforward过程，activations 保存所有层-层的 activations，zs 保存所有的层-层 z 向量。
遍历所有的 layer，计算出 z 和 activation 最终保存到列表 zs 和 activations 中。
然后backprop，首先计算出最终输出的 delta，以及sigmoid函数的导数。nabla_b[-1] 和 nabla_w[-1] 单独算出。
然后对前面的层进行遍历，反向传播。

测试评价函数

    def evaluate(self, test_data):
        """Return the number of test inputs for which the neural
        network outputs the correct result. Note that the neural
        network's output is assumed to be the index of whichever
        neuron in the final layer has the highest activation."""
        test_results = [(np.argmax(self.feedforward(x)), y) 
                        for (x, y) in test_data]
        return sum(int(x == y) for (x, y) in test_results)

代价函数的导数

    def cost_derivative(self, output_activations, y):
        """Return the vector of partial derivatives \partial C_x /
        \partial a for the output activations."""
        return (output_activations-y)

Appendix

def sigmoid(z):
    """The sigmoid function."""
    return 1.0/(1.0+np.exp(-z))

sigmoid_vec = np.vectorize(sigmoid)

def sigmoid_prime(z):
    """Derivative of the sigmoid function."""
    return sigmoid(z)*(1-sigmoid(z))

sigmoid_prime_vec = np.vectorize(sigmoid_prime)

这里有 numpy 的函数 vectorize 的使用

为何说 BP 是一个快速的算法

为了回答这个问题，首先考虑另一个计算梯度的方法。就当我们回到上世界50、60年代的神经网络研究。假设你是世界上首个考虑使用梯度下降方法学习的那位！为了让自己的想法可行，就必须找出计算代价函数梯度的方法。想想自己学到的微积分，决定试试看链式法则来计算梯度。但玩了一会后，就发现代数式看起来非常复杂，然后就退缩了。所以就试着找另外的方式。你决定仅仅把代价看做权重 C 的函数。你给这些权重 w_1, w_2, ... 进行编号，期望计算关于某个权值 w_j 关于 C 的导数。而一种近似的方法就是下面这种：

Paste_Image.png

其中 epsilon>0 是一个很小的正数，而 e_j 是在第j个方向上的单位向量。换句话说，我们可以通过计算w_j 的两个接近相同的点的值来估计 dC/dw_j，然后应用公式（46）。同样方法也可以用来计算 dC/db。
这个观点看起来非常有希望。概念上易懂，容易实现，使用几行代码就可以搞定。看起来，这样的方法要比使用链式法则还要有效。
然后，遗憾的是，当你实现了之后，运行起来这样的方法非常缓慢。为了理解原因，假设我们有 1,000,000 权重。对每个不同的权重 w_j 我们需要计算 C(w+\epsilon * e_j 来计算 dC/dw_j。这意味着为了计算梯度，我们需要计算代价函数 1, 000, 000 次，需要 1, 000, 000 前向传播（对每个样本）。我们同样需要计算 C(w)，总共是 1,000,001 次。
BP 聪明的地方就是它确保我们可以同时计算所有的偏导数 dC/dw_j 使用一次前向传播，加上一次后向传播。粗略地说，后向传播的计算代价和前向的一样。*

*This should be plausible, but it requires some analysis to make a careful statement. It's plausible because the dominant computational cost in the forward pass is multiplying by the weight matrices, while in the backward pass it's multiplying by the transposes of the weight matrices. These operations obviously have similar computational cost. 这个说法是合理的，但需要额外的说明来澄清这一事实。在前向传播过程中主要的计算代价消耗在权重矩阵的乘法上，而反向传播则是计算权重矩阵的转置矩阵。这些操作显然有着类似的计算代价。

所以最终的计算代价大概是两倍的前向传播计算大家。比起直接计算导数，显然 BP 有着更大的优势。所以即使 BP 看起来要比 (46) 更加复杂，但实际上要更快。

这个加速在1986年首次被众人接受，并直接导致神经网络可以处理的问题的扩展。这也导致了大量的研究者涌向了神经网络方向。当然，BP 并不是万能钥匙。在 1980 年代后期，人们尝试挑战极限，尤其是尝试使用BP来训练深度神经网络。本书后面，我们将看到现代计算机和一些聪明的新想法已经让 BP 成功地训练这样的深度神经网络。

BP 大框架

正如我所讲解的，BP 提出了两个神秘的问题。首先，这个算法真正在干什么？我们已经感受到从输出处的错误被反向传回的图景。但是我们能够更深入一些，构造出一种更加深刻的直觉来解释所有这些矩阵和向量乘法么？第二神秘点就是，某人为什么能发现这个 BP？跟着一个算法跑一遍甚至能够理解证明算法 work 这是一回事。这并不真的意味着你理解了这个问题到一定程度，能够发现这个算法。是否有一个推理的思路可以指引我们发现 BP 算法？本节，我们来探讨一下这两个谜题。
为了提升我们关于算法究竟做了什么的直觉，假设我们已经做出一点小小的变动 \Delta w_{jk}^l

to be cont.