ImageNet Classification with Deep Convolutional Neural Networks论文翻译[上]

ImageNet Classification with Deep Convolutional Neural Networks论文翻译下

AlexNet实现地址(基于PyTorch): https://github.com/Lornatang/pytorch/blob/master/official/net/alexnet.py

ImageNet Classification with Deep Convolutional Neural Networks

深度卷积神经网络的ImageNet分类

论文：http://static.tongtianta.site/paper_pdf/2c26fb78-7abb-11e8-87f8-00163e08ba34.pdf

Abstract

摘要

We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of ﬁve convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a ﬁnal 1000-way softmax. To make training faster, we used non-saturating neurons and a very efﬁcient GPU implementation of the convolution operation. To reduce overﬁtting in the fully-connected layers we employed a recently-developed regularization method called “dropout” that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

我们训练了一个庞大的深层卷积神经网络，将ImageNet LSVRC-2010比赛中的120万张高分辨率图像分为1000个不同的类别。在测试数据上，我们取得了37.5％和17.0％的前1和前5的错误率，这比以前的先进水平要好得多。具有6000万个参数和650,000个神经元的神经网络由五个卷积层组成，其中一些是最大汇聚层，另外三个是完全连接层，最后有1000个方向的最大值。为了加快训练速度，我们使用非饱和神经元和一个非常有效的卷积运算的GPU实现。为了减少完全连接层中的过度配合，我们采用了最近开发的称为“辍学”的正则化方法，该方法证明是非常有效的。我们还在ILSVRC-2012比赛中进入了这种模式的一个变种，取得了15.3％的前五名测试失误率，而第二名的成绩是26.2％。

1 Introduction

1引言

Current approaches to object recognition make essential use of machine learning methods. To improve their performance, we can collect larger datasets, learn more powerful models, and use better techniques for preventing overﬁtting. Until recently, datasets of labeled images were relatively small — on the order of tens of thousands of images (e.g., NORB [16], Caltech-101/256 [8, 9], and CIFAR-10/100 [12]). Simple recognition tasks can be solved quite well with datasets of this size, especially if they are augmented with label-preserving transformations. For example, the currentbest error rate on the MNIST digit-recognition task (<0.3%) approaches human performance [4]. But objects in realistic settings exhibit considerable variability, so to learn to recognize them it is necessary to use much larger training sets. And indeed, the shortcomings of small image datasets have been widely recognized (e.g., Pinto et al. [21]), but it has only recently become possible to collect labeled datasets with millions of images. The new larger datasets include LabelMe [23], which consists of hundreds of thousands of fully-segmented images, and ImageNet [6], which consists of over 15 million labeled high-resolution images in over 22,000 categories.

目前的物体识别方法对机器学习方法的使用非常重要。为了提高他们的表现，我们可以收集更大的数据集，学习更强大的模型，并使用更好的技术来防止过度填充。直到最近，标记图像的数据集相对较小 - 数量级成千上万的图像（例如，NORB [16]，Caltech-101/256 [8,9]和CIFAR-10/100 [12]）。使用这种尺寸的数据集可以很好地解决简单的识别任务，特别是如果他们增加了标签保留转换。例如，当前MNIST数字识别任务的最佳错误率（<0.3％）接近人类表现[4]。但是在现实环境中的物体表现出相当大的变化性，所以要学会识别它们，就必须使用更大的训练集。事实上，小图像数据集的缺点已被广泛认可（例如，Pinto等[21]），但最近才有可能收集带有数百万图像的标记数据集。新的大型数据集包括LabelMe [23]，其中包含数十万个完全分割的图像，以及ImageNet [6]，其中包含超过15,000万个超过22,000个类别的高分辨率图像。

To learn about thousands of objects from millions of images, we need a model with a large learning capacity. However, the immense complexity of the object recognition task means that this problem cannot be speciﬁed even by a dataset as large as ImageNet, so our model should also have lots of prior knowledge to compensate for all the data we don’t have. Convolutional neural networks (CNNs) constitute one such class of models [16, 11, 13, 18, 15, 22, 26]. Their capacity can be controlled by varying their depth and breadth, and they also make strong and mostly correct assumptions about the nature of images (namely, stationarity of statistics and locality of pixel dependencies). Thus, compared to standard feedforward neural networks with similarly-sized layers, CNNs have much fewer connections and parameters and so they are easier to train, while their theoretically-best performance is likely to be only slightly worse.

要从数百万图像中了解数千个对象，我们需要一个具有大量学习能力的模型。然而，对象识别任务的巨大复杂性意味着即使是像ImageNet这样大的数据集也不能指定这个问题，所以我们的模型也应该有很多先验知识来补偿我们没有的所有数据。卷积神经网络（CNN）构成了这样一类模型[16,11,13,18,15,22,26]。他们的能力可以通过改变他们的深度和广度来加以控制，他们也对图像的性质（即统计数据的平稳性和像素依赖性的局部性）做出强而且大多数正确的假设。因此，与具有相同大小的层的标准前馈神经网络相比，CNN具有更少的连接和参数，因此它们更容易训练，而其理论上最好的性能可能仅稍微更差。

Despite the attractive qualities of CNNs, and despite the relative efﬁciency of their local architecture, they have still been prohibitively expensive to apply in large scale to high-resolution images. Luckily, current GPUs, paired with a highly-optimized implementation of 2D convolution, are powerful enough to facilitate the training of interestingly-large CNNs, and recent datasets such as ImageNet contain enough labeled examples to train such models without severe overﬁtting.

尽管CNN具有吸引人的特质，并且尽管其本地架构相对有效，但它们在大规模应用于高分辨率图像方面仍然过于昂贵。幸运的是，当前的GPU与高度优化的二维卷积实现配合使用，足以促进对有趣的大型CNN的训练，而最近的数据集（如ImageNet）包含足够的标记示例来训练此类模型，而不会出现严重过度拟合。

The speciﬁc contributions of this paper are as follows: we trained one of the largest convolutional neural networks to date on the subsets of ImageNet used in the ILSVRC-2010 and ILSVRC-2012 competitions [2] and achieved by far the best results ever reported on these datasets. We wrote a highly-optimized GPU implementation of 2D convolution and all the other operations inherent in training convolutional neural networks, which we make available publicly1. Our network contains a number of new and unusual features which improve its performance and reduce its training time, which are detailed in Section 3. The size of our network made overﬁtting a signiﬁcant problem, even with 1.2 million labeled training examples, so we used several effective techniques for preventing overﬁtting, which are described in Section 4. Our ﬁnal network contains ﬁve convolutional and three fully-connected layers, and this depth seems to be important: we found that removing any convolutional layer (each of which contains no more than 1% of the model’s parameters) resulted in inferior performance.

本文的具体贡献如下：我们在ILSVRC-2010和ILSVRC-2012比赛中使用的ImageNet子集上训练了迄今为止最大的卷积神经网络之一[2]，并取得了迄今为止报道的最好的结果这些数据集。我们编写了一个高度优化的2D卷积GPU实现以及训练卷积神经网络固有的所有其他操作，我们可以公开发布1。我们的网络包含许多新的和不同寻常的功能，可以提高其性能并缩短训练时间，详见第3节。我们的网络的大小使过度拟合成为一个重要的问题，即使有120万个标记的训练例子，所以我们使用了一些有效的技术来防止过度拟合，如第4节所述。我们的最终网络包含五个卷积和三个完全连接的层，这个深度似乎很重要：我们发现去除任何卷积层（每个卷积层不超过模型参数的1％）导致性能较差。

In the end, the network’s size is limited mainly by the amount of memory available on current GPUs and by the amount of training time that we are willing to tolerate. Our network takes between ﬁve and six days to train on two GTX 580 3GB GPUs. All of our experiments suggest that our results can be improved simply by waiting for faster GPUs and bigger datasets to become available.

最后，网络的规模主要受限于当前GPU上可用的内存量以及我们愿意接受的培训时间。我们的网络需要5至6天的时间才能在两台GTX 580 3GB GPU上进行培训。我们所有的实验都表明，通过等待更快的GPU和更大的数据集变得可用，我们的结果可以得到改善。

2 The Dataset

2数据集

ImageNet is a dataset of over 15 million labeled high-resolution images belonging to roughly 22,000 categories. The images were collected from the web and labeled by human labelers using Amazon’s Mechanical Turk crowd-sourcing tool. Starting in 2010, as part of the Pascal Visual Object Challenge, an annual competition called the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) has been held. ILSVRC uses a subset of ImageNet with roughly 1000 images in each of 1000 categories. In all, there are roughly 1.2 million training images, 50,000 validation images, and 150,000 testing images.

ImageNet是超过1500万个标记的高分辨率图像的数据集，属于大约22,000个类别。这些图像是从网上收集的，并使用亚马逊的Mechanical Turk群众采购工具由人类贴标签商标记。从2010年开始，作为Pascal视觉对象挑战赛的一部分，每年举办一次名为ImageNet大型视觉识别挑战赛（ILSVRC）的比赛。ILSVRC使用ImageNet的一个子集，每1000个类别中大约有1000个图像。总共有大约120万个训练图像，50,000个验证图像和150,000个测试图像。

ILSVRC-2010 is the only version of ILSVRC for which the test set labels are available, so this is the version on which we performed most of our experiments. Since we also entered our model in the ILSVRC-2012 competition, in Section 6 we report our results on this version of the dataset as well, for which test set labels are unavailable. On ImageNet, it is customary to report two error rates: top-1 and top-5, where the top-5 error rate is the fraction of test images for which the correct label is not among the ﬁve labels considered most probable by the model.

ILSVRC-2010是ILSVRC的唯一可用测试集标签版本，因此这是我们执行大部分实验的版本。由于我们也在ILSVRC-2012比赛中进入了我们的模型，因此在第6部分中，我们也报告了此版本数据集的结果，并且测试集标签不可用。在ImageNet上，习惯上报告两种错误率：top-1和top-5，其中top-5错误率是测试图像的分数，正确标签不是被模型认为最可能的五个标签之一。

ImageNet consists of variable-resolution images, while our system requires a constant input dimensionality. Therefore, we down-sampled the images to a ﬁxed resolution of

image

. Given a rectangular image, we ﬁrst rescaled the image such that the shorter side was of length 256, and then cropped out the central

image

patch from the resulting image. We did not pre-process the images in any other way, except for subtracting the mean activity over the training set from each pixel. So we trained our network on the (centered) raw RGB values of the pixels.

ImageNet由可变分辨率的图像组成，而我们的系统需要恒定的输入维度。因此，我们将图像下采样到

image

的固定分辨率。给定一个矩形图像，我们首先重新缩放图像，使得短边长度为256，然后从结果图像中裁剪出中心

image

补丁。除了从每个像素中减去训练集上的平均活动之外，我们没有以任何其他方式预处理图像。所以我们在像素的（中心）原始RGB值上训练了我们的网络。

3 The Architecture

3建筑

The architecture of our network is summarized in Figure 2. It contains eight learned layers — ﬁve convolutional and three fully-connected. Below, we describe some of the novel or unusual features of our network’s architecture. Sections 3.1-3.4 are sorted according to our estimation of their importance, with the most important ﬁrst.

图2总结了我们网络的体系结构。它包含八个学习层 - 五个卷积和三个完全连接。下面，我们描述一些我们网络架构的新颖或不寻常的特征。3.1-3.4节按照我们对它们重要性的估计进行分类，最重要的是第一个。

1http://code.google.com/p/cuda-convnet/

1http：//code.google.com/p/cuda-convnet/

3.1 ReLU Nonlinearity

3.1 ReLU非线性

The standard way to model a neuron’s output f as a function of its input x is with

image

. In terms of training time with gradient descent, these saturating nonlinearities are much slower than the non-saturating nonlinearity

image

. Following Nair and Hinton [20], we refer to neurons with this nonlinearity as Rectiﬁed Linear Units (ReLUs). Deep convolutional neural networks with ReLUs train several times faster than their equivalents with tanh units. This is demonstrated in Figure 1, which shows the number of iterations required to reach 25% training error on the CIFAR-10 dataset for a particular four-layer convolutional network. This plot shows that we would not have been able to experiment with such large neural networks for this work if we had used traditional saturating neuron Figure 1: A four-layer convolutional neural models. network with ReLUs (solid line) reaches a 25% training error rate on CIFAR-10 six times faster We are not the ﬁrst to consider alternatives to tradi than an equivalent network with tanh neurons tional neuron models in CNNs. For example, Jarrett (dashed line). The learning rates for each net et al. [11] claim that the nonlinearity

image

work were chosen independently to make train works particularly well with their type of contrast nor ing as fast as possible. No regularization of malization followed by local average pooling on the any kind was employed. The magnitude of the Caltech-101 dataset. However, on this dataset the pri effect demonstrated here varies with network mary concern is preventing overﬁtting, so the effect architecture, but networks with ReLUs consis they are observing is different from the accelerated tently learn several times faster than equivalents ability to ﬁt the training set which we report when us with saturating neurons.

将神经元的输出f建模为其输入x的函数的标准方法是使用

image

或

image

。就梯度下降训练时间而言，这些饱和非线性比非饱和非线性

image

慢得多。根据Nair和Hinton [20]，我们将具有这种非线性的神经元称为Reci fi ed Linear Units（ReLU）。使用ReLU的深度卷积神经网络的训练速度比其tanh单位的数量快几倍。图1展示了这一点，图中显示了特定四层卷积网络的CIFAR-10数据集达到25％训练误差所需的迭代次数。这个图显示，如果我们使用传统的饱和神经元，那么我们就不能用这种大型的神经网络进行实验。图1：四层卷积神经模型。使用ReLUs（实线）的网络在CIFAR-10上的训练错误率达到25％快6倍我们不是首先考虑使用传统网络的替代方案，而不是在CNN中使用tanh神经元神经元模型的等效网络。例如，Jarrett（虚线）。每个网络的学习率等。 [11]声称，非线性

image

工作是独立选择的，以使列车工作特别好，其对比类型也不尽可能快。没有任何正规化的恶化，其次是当地的平均水平。 Caltech-101数据集的大小。然而，在这个数据集中，这里展示的pri效应随着网络的不同而有所不同，因为网络mary关心的是防止过度配合，所以效果架构，但是他们观察到的具有ReLUs一致性的网络与加速学习不同，当我们用饱和神经元报告时，我们会报告它。

image

ing ReLUs. Faster learning has a great inﬂuence on the performance of large models trained on large datasets.

ReLUs。加快学习对在大型数据集上训练的大型模型的性能有很大的影响。

3.2 Training on Multiple GPUs

3.2在多个GPU上进行培训

A single GTX 580 GPU has only 3GB of memory, which limits the maximum size of the networks that can be trained on it. It turns out that 1.2 million training examples are enough to train networks which are too big to ﬁt on one GPU. Therefore we spread the net across two GPUs. Current GPUs are particularly well-suited to cross-GPU parallelization, as they are able to read from and write to one another’s memory directly, without going through host machine memory. The parallelization scheme that we employ essentially puts half of the kernels (or neurons) on each GPU, with one additional trick: the GPUs communicate only in certain layers. This means that, for example, the kernels of layer 3 take input from all kernel maps in layer 2. However, kernels in layer 4 take input only from those kernel maps in layer 3 which reside on the same GPU. Choosing the pattern of connectivity is a problem for cross-validation, but this allows us to precisely tune the amount of communication until it is an acceptable fraction of the amount of computation.

单个GTX 580 GPU只有3GB内存，这限制了可以在其上训练的网络的最大尺寸。事实证明，120万个训练样例足以训练那些太大而不适合在一个GPU上的网络。因此，我们将网络分布在两个GPU上。目前的GPU特别适合于跨GPU并行化，因为它们能够直接读写对方的内存，而无需通过主机内存。我们使用的并行化方案基本上将每半个内核（或神经元）放在每个GPU上，另外还有一个技巧：GPU只在某些层进行通信。这意味着，例如，第3层的内核从第2层的所有内核映射中获取输入。但是，第4层中的内核只能从驻留在同一GPU上的第3层中的那些内核映射接收输入。选择连通性模式是交叉验证的一个问题，但这使我们能够精确调整通信量，直到它达到计算量的可接受部分。

The resultant architecture is somewhat similar to that of the “columnar” CNN employed by Cire¸san et al. [5], except that our columns are not independent (see Figure 2). This scheme reduces our top-1 and top-5 error rates by 1.7% and 1.2%, respectively, as compared with a net with half as many kernels in each convolutional layer trained on one GPU. The two-GPU net takes slightly less time to train than the one-GPU net2.

由此产生的架构有点类似于Cire¸san等人使用的“柱状”CNN。 [5]，除了我们的列不是独立的（见图2）。与一个GPU上训练的每个卷积层内核数量减少一半的网络相比，该方案分别将我们的前1和前5的错误率分别降低了1.7％和1.2％。双GPU网络的培训时间比单GPU网络少2。

2The one-GPU net actually has the same number of kernels as the two-GPU net in the ﬁnal convolutional layer. This is because most of the net’s parameters are in the ﬁrst fully-connected layer, which takes the last convolutional layer as input. So to make the two nets have approximately the same number of parameters, we did not halve the size of the ﬁnal convolutional layer (nor the fully-conneced layers which follow). Therefore this comparison is biased in favor of the one-GPU net, since it is bigger than “half the size” of the two-GPU net.

2单GPU实际上与最终卷积层中的双GPU网具有相同数量的内核。这是因为大多数网络参数都在第一个完全连接层中，它将最后一个卷积层作为输入。因此，为了使两个网络具有大致相同数量的参数，我们没有减半最后的卷积层的大小（以及后面的完全连接的层）。因此，这种比较偏向于单GPU网络，因为它大于双GPU网络的“一半大小”。

3.3 Local Response Normalization

3.3本地响应标准化

ReLUs have the desirable property that they do not require input normalization to prevent them from saturating. If at least some training examples produce a positive input to a ReLU, learning will happen in that neuron. However, we still ﬁnd that the following local normalization scheme aids generalization. Denoting by

image

the activity of a neuron computed by applying kernel i at position

image

and then applying the ReLU nonlinearity, the response-normalized activity

image

is given by the expression

ReLU具有理想的属性，它们不需要输入规范化来防止它们饱和。如果至少有一些训练实例为ReLU产生了积极的输入，则将在该神经元中进行学习。但是，我们仍然发现以下地方正常化方案有助于推广。

image

表示通过在位置

image

上应用核i而计算的神经元的活动，然后应用ReLU非线性，响应标准化活动

image

由表达式

image

where the sum runs over n “adjacent” kernel maps at the same spatial position, and N is the total number of kernels in the layer. The ordering of the kernel maps is of course arbitrary and determined before training begins. This sort of response normalization implements a form of lateral inhibition inspired by the type found in real neurons, creating competition for big activities amongst neuron outputs computed using different kernels. The constants

image

, and β are hyper-parameters whose values are determined using a validation set; we used

image

, and

image

. We applied this normalization after applying the ReLU nonlinearity in certain layers (see Section 3.5).

其中总和在相同空间位置处的n个“相邻”核映射上运行，并且N是该层中的核的总数。内核映射的排序当然是任意的，并且在训练开始之前确定。这种响应标准化实现了一种由真实神经元中发现的类型所激发的横向抑制形式，从而在使用不同内核计算的神经元输出之间产生大活动的竞争。常量

image

和β是超参数，其值是使用验证集确定的;我们使用

image

，

image

，

image

和

image

。我们在应用某些层的ReLU非线性后应用了这种规范化（参见第3.5节）。

This scheme bears some resemblance to the local contrast normalization scheme of Jarrett et al. [11], but ours would be more correctly termed “brightness normalization”, since we do not subtract the mean activity. Response normalization reduces our top-1 and top-5 error rates by 1.4% and 1.2%, respectively. We also veriﬁed the effectiveness of this scheme on the CIFAR-10 dataset: a four-layer CNN achieved a 13% test error rate without normalization and 11% with normalization3.

该方案与Jarrett等人的局部对比归一化方案有某些相似之处。 [11]，但我们的将被更准确地称为“亮度标准化”，因为我们不减去平均活动。响应规范化将我们的前1和前5的错误率分别降低1.4％和1.2％。我们还验证了这种方案在CIFAR-10数据集上的有效性：四层CNN实现了13％的没有标准化的测试错误率和11％的正常化3。

3.4 Overlapping Pooling

3.4重叠池

Pooling layers in CNNs summarize the outputs of neighboring groups of neurons in the same kernel map. Traditionally, the neighborhoods summarized by adjacent pooling units do not overlap (e.g., [17, 11, 4]). To be more precise, a pooling layer can be thought of as consisting of a grid of pooling units spaced s pixels apart, each summarizing a neighborhood of size

image

centered at the location of the pooling unit. If we set

image

, we obtain traditional local pooling as commonly employed in CNNs. If we set

image

, we obtain overlapping pooling. This is what we use throughout our network, with

image

and

image

. This scheme reduces the top-1 and top-5 error rates by 0.4% and 0.3%, respectively, as compared with the non-overlapping scheme

image

, which produces output of equivalent dimensions. We generally observe during training that models with overlapping pooling ﬁnd it slightly more difﬁcult to overﬁt.

CNN中的汇聚层汇总了相同内核映射中相邻神经元组的输出。传统上，由相邻联营单位汇总的街区不重叠（例如[17,11,4]）。更准确地说，一个池化层可以被认为是由一个间距为s像素的池化单元组成的网格，每个池总结一个以池化单元位置为中心的大小为

image

的邻域。如果我们设置

image

，我们就可以像CNN中常用的那样获得传统的本地池。如果我们设置

image

，我们获得重叠池。这是我们在整个网络中使用的

image

和

image

。与非重叠方案

image

相比，该方案分别将前1和前5的错误率分别降低了0.4％和0.3％，后者产生了等效尺寸的输出。我们通常在训练中观察到重叠池的模型发现它稍微难以过度。

3.5 Overall Architecture

3.5总体架构

Now we are ready to describe the overall architecture of our CNN. As depicted in Figure 2, the net contains eight layers with weights; the ﬁrst ﬁve are convolutional and the remaining three are fullyconnected. The output of the last fully-connected layer is fed to a 1000-way softmax which produces a distribution over the 1000 class labels. Our network maximizes the multinomial logistic regression objective, which is equivalent to maximizing the average across training cases of the log-probability of the correct label under the prediction distribution.

现在我们准备好描述CNN的整体架构。如图2所示，网包含八层重量;第一个五是卷积，其余三个完全连接。最后完全连接的层的输出被馈送到1000路softmax，其产生1000个类别标签上的分布。我们的网络最大化多项逻辑回归目标，这相当于在预测分布下最大化正确标签对数概率的训练案例的平均值。

The kernels of the second, fourth, and ﬁfth convolutional layers are connected only to those kernel maps in the previous layer which reside on the same GPU (see Figure 2). The kernels of the third convolutional layer are connected to all kernel maps in the second layer. The neurons in the fullyconnected layers are connected to all neurons in the previous layer. Response-normalization layers follow the ﬁrst and second convolutional layers. Max-pooling layers, of the kind described in Section 3.4, follow both response-normalization layers as well as the ﬁfth convolutional layer. The ReLU non-linearity is applied to the output of every convolutional and fully-connected layer.

第二，第四和第五卷积层的内核仅与上一层驻留在同一GPU上的内核映射相连（见图2）。第三卷积层的内核连接到第二层中的所有内核映射。完全连接层中的神经元连接到前一层中的所有神经元。响应标准化层遵循第一和第二卷积层。3.4节中所描述的最大汇集层跟随响应规范化层以及第五卷积层。将ReLU非线性应用于每个卷积和完全连接层的输出。

The ﬁrst convolutional layer ﬁlters the

image

input image with 96 kernels of size

image

with a stride of 4 pixels (this is the distance between the receptive ﬁeld centers of neighboring between the two GPUs. One GPU runs the layer-parts at the top of the ﬁgure while the other runs the layer-parts at the bottom. The GPUs communicate only at certain layers. The network’s input is 150,528-dimensional, and the number of neurons in the network’s remaining layers is given by 253,440–186,624–64,896–64,896–43,264– 4096–4096–1000.

第一个卷积层用

image

大小的

image

输入图像进行滤波，大小为4个像素（这是两个GPU之间相邻接收域之间的距离。一个GPU运行图层顶部的图层部分，另一个运行图层底部的图层部分。 GPU仅在特定层进行通信。网络的输入为150,528维，网络剩余层中的神经元数由253,440-186,624-64,896-64,896-43,264-4096-4096-1000给出。

3We cannot describe this network in detail due to space constraints, but it is speciﬁed precisely by the code and parameter ﬁles provided here: http://code.google.com/p/cuda-convnet/.

3由于空间限制，我们无法详细描述此网络，但详细说明请参阅此处提供的代码和参数文件：http：//code.google.com/p/cuda-convnet/。

image

Figure 2: An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilities

图2：我们有线电视新闻网体系结构的一个例证，明确显示责任的划分

neurons in a kernel map). The second convolutional layer takes as input the (response-normalized and pooled) output of the ﬁrst convolutional layer and ﬁlters it with 256 kernels of size

image

. The third, fourth, and ﬁfth convolutional layers are connected to one another without any intervening pooling or normalization layers. The third convolutional layer has 384 kernels of size