In recent years, supervised learning with convolutional networks (CNNs) has seen huge adoption in computer vision applications. Comparatively, unsupervised learning with CNNs has received less attention. In this work we hope to help bridge the gap between the success of CNNs for supervised learning and unsupervised learning. We introduce a class of CNNs called deep convolutional generative adversarial networks (DCGANs), that have certain architectural constraints, and demonstrate that they are a strong candidate for unsupervised learning. Training on various image datasets, we show convincing evidence that our deep convolutional adversarial pair learns a hierarchy of representations from object parts to scenes in both the generator and discriminator. Additionally, we use the learned features for novel tasks - demonstrating their applicability as general image representations.

近年来,卷积网络(CNN)的监督式学习在计算机视觉应用中得到了广泛的应用。相比之下,无监督的CNN学习受到的关注较少。在这项工作中,我们希望能够帮助弥合有监督学习的CNN成功与无监督学习之间的差距。我们引入了一类称为深度卷积生成对抗网络(CNG)的类,它具有一定的架构约束,并证明它们是非监督学习的有力候选。对各种图像数据集进行训练,我们展示出令人信服的证据,证明我们深层卷积对抗对从发生器和鉴别器中的对象部分到场景学习了表示层次。此外,我们使用学习的功能进行新颖的任务 - 证明其作为一般图像表示的适用性。



Learning reusable feature representations from large unlabeled datasets has been an area of active research. In the context of computer vision, one can leverage the practically unlimited amount of unlabeled images and videos to learn good intermediate representations, which can then be used on a variety of supervised learning tasks such as image classification. We propose that one way to build good image representations is by training Generative Adversarial Networks (GANs) (Goodfellow et al., 2014), and later reusing parts of the generator and discriminator networks as feature extractors for supervised tasks. GANs provide an attractive alternative to maximum likelihood techniques. One can additionally argue that their learning process and the lack of a heuristic cost function (such as pixel-wise independent mean-square error) are attractive to representation learning. GANs have been known to be unstable to train, often resulting in generators that produce nonsensical outputs. There has been very limited published research in trying to understand and visualize what GANs learn, and the intermediate representations of multi-layer GANs.


In this paper, we make the following contributions


• We propose and evaluate a set of constraints on the architectural topology of Convolutional GANs that make them stable to train in most settings. We name this class of architectures Deep Convolutional GANs (DCGAN)

•我们提出并评估了一系列对卷积GAN的架构拓扑的约束条件,这些约束条件使得它们在大多数环境中都能够稳定地进行训练。我们将这类架构命名为Deep Convolutional GAN(DCGAN)

• We use the trained discriminators for image classification tasks, showing competitive performance with other unsupervised algorithms.


• We visualize the filters learnt by GANs and empirically show that specific filters have learned to draw specific objects.


• We show that the generators have interesting vector arithmetic properties allowing for easy manipulation of many semantic qualities of generated samples.






Unsupervised representation learning is a fairly well studied problem in general computer vision research, as well as in the context of images. A classic approach to unsupervised representation learning is to do clustering on the data (for example using K-means), and leverage the clusters for improved classification scores. In the context of images, one can do hierarchical clustering of image patches (Coates & Ng, 2012) to learn powerful image representations. Another popular method is to train auto-encoders (convolutionally, stacked (Vincent et al., 2010), separating the what and where components of the code (Zhao et al., 2015), ladder structures (Rasmus et al., 2015)) that encode an image into a compact code, and decode the code to reconstruct the image as accurately as possible. These methods have also been shown to learn good feature representations from image pixels. Deep belief networks (Lee et al., 2009) have also been shown to work well in learning hierarchical representations.

无监督表示学习在计算机视觉一般性研究中以及在图像上下文中是一个相当好的研究问题。无监督表示学习的经典方法是对数据进行聚类(例如使用K均值),并利用聚类提高分类分数。在图像上下文中,可以对图像块进行分层聚类(Coates&Ng,2012),以学习强大的图像表示。另一种流行的方法是训练自动编码器(卷积,堆叠(Vincent et al。,2010),将代码的内容和组成部分分开(Zhao et al。,2015),阶梯结构(Rasmus等,2015) )将图像编码成紧凑的代码,并对代码进行解码以尽可能准确地重建图像。这些方法也被证明可以从图像像素学习好的特征表示。深度信念网络(Lee et al。,2009)也被证明在学习分层表示方面效果很好。



Generative image models are well studied and fall into two categories: parametric and nonparametric.


The non-parametric models often do matching from a database of existing images, often matching patches of images, and have been used in texture synthesis (Efros et al., 1999), super-resolution (Freeman et al., 2002) and in-painting (Hays & Efros, 2007).

非参数模型通常与现有图像的数据库进行匹配,通常匹配图像块,并且已经用于纹理合成(Efros等人,1999),超分辨率(Freeman等人,2002)和 - 绘画(Hays&Efros,2007)。

Parametric models for generating images has been explored extensively (for example on MNIST digits or for texture synthesis (Portilla & Simoncelli, 2000)). However, generating natural images of the real world have had not much success until recently. A variational sampling approach to generating images (Kingma & Welling, 2013) has had some success, but the samples often suffer from being blurry.Another approach generates images using an iterative forward diffusion process (Sohl-Dickstein et al., 2015). Generative Adversarial Networks (Goodfellow et al., 2014) generated images suffering from being noisy and incomprehensible. A laplacian pyramid extension to this approach (Denton et al., 2015) showed higher quality images, but they still suffered from the objects looking wobbly because of noise introduced in chaining multiple models. A recurrent network approach (Gregor et al., 2015) and a deconvolution network approach (Dosovitskiy et al., 2014) have also recently had some success with generating natural images. However, they have not leveraged the generators for supervised tasks.

用于生成图像的参数模型已被广泛探索(例如MNIST数字或纹理合成(Portilla&Simoncelli,2000))。然而,直到最近,生成真实世界的自然图像并没有取得太大的成功。用于生成图像的变分抽样方法(Kingma&Welling,2013)取得了一些成功,但样本经常遭受模糊。另一种方法使用迭代正向扩散过程生成图像(Sohl-Dickstein等,2015)。生成敌对网络(Goodfellow et al。,2014)生成的图像嘈杂和难以理解。这种方法的拉普拉斯金字塔延伸(Denton等人,2015)显示出更高质量的图像,但由于链接多个模型中引入的噪声,它们仍然受到物体晃动的影响。经常性网络方法(Gregor等,2015)和去卷积网络方法(Dosovitskiy et al。,2014)最近也在生成自然图像方面取得了一些成功。但是,他们没有将发电机用于监督任务。



One constant criticism of using neural networks has been that they are black-box methods, with little understanding of what the networks do in the form of a simple human-consumable algorithm. In the context of CNNs, Zeiler et. al. (Zeiler & Fergus, 2014) showed that by using deconvolutions and filtering the maximal activations, one can find the approximate purpose of each convolution filter in the network. Similarly, using a gradient descent on the inputs lets us inspect the ideal image that activates certain subsets of filters (Mordvintsev et al.).

对使用神经网络的一个不断批评是它们是黑盒子方法,几乎不了解网络以简单的人类可消费算法的形式做什么。在CNN的情况下,Zeiler et。人。 (Zeiler&Fergus,2014)表明,通过使用反卷积和过滤最大激活,可以找出网络中每个卷积滤波器的近似目的。类似地,在输入上使用梯度下降可以让我们检查激活某些过滤器子集的理想图像(Mordvintsev等人)。



Historical attempts to scale up GANs using CNNs to model images have been unsuccessful. This motivated the authors of LAPGAN (Denton et al., 2015) to develop an alternative approach to iteratively upscale low resolution generated images which can be modeled more reliably. We also encountered difficulties attempting to scale GANs using CNN architectures commonly used in the supervised literature. However, after extensive model exploration we identified a family of archi


tectures that resulted in stable training across a range of datasets and allowed for training higher resolution and deeper generative models.


Core to our approach is adopting and modifying three recently demonstrated changes to CNN architectures.


The first is the all convolutional net (Springenberg et al., 2014) which replaces deterministic spatial pooling functions (such as maxpooling) with strided convolutions, allowing the network to learn its own spatial downsampling. We use this approach in our generator, allowing it to learn its own spatial upsampling, and discriminator.

第一个是全卷积网络(Springenberg et al。,2014),它用逐步卷积代替确定性空间汇聚函数(如maxpooling),允许网络学习它自己的空间下采样。我们在我们的生成器中使用这种方法,允许它学习它自己的空间上采样和鉴别器。

Second is the trend towards eliminating fully connected layers on top of convolutional features. The strongest example of this is global average pooling which has been utilized in state of the art image classification models (Mordvintsev et al.). We found global average pooling increased model stability but hurt convergence speed. A middle ground of directly connecting the highest convolutional features to the input and output respectively of the generator and discriminator worked well. The first layer of the GAN, which takes a uniform noise distribution Z as input, could be called fully connected as it is just a matrix multiplication, but the result is reshaped into a 4-dimensional tensor and used as the start of the convolution stack. For the discriminator, the last convolution layer is flattened and then fed into a single sigmoid output. See Fig. 1 for a visualization of an example model architecture.


Third is Batch Normalization (Ioffe & Szegedy, 2015) which stabilizes learning by normalizing the input to each unit to have zero mean and unit variance. This helps deal with training problems that arise due to poor initialization and helps gradient flow in deeper models. This proved critical to get deep generators to begin learning, preventing the generator from collapsing all samples to a single point which is a common failure mode observed in GANs. Directly applying batchnorm to all layers however, resulted in sample oscillation and model instability. This was avoided by not applying batchnorm to the generator output layer and the discriminator input layer.

第三是批量标准化(Ioffe&Szegedy,2015),通过将每个单元的输入标准化为零均值和单位差异来稳定学习。这有助于处理由于初始化较差而出现的培训问题,并帮助深层模型中的渐变流。这对于让深层发生器开始学习非常重要,可以防止发生器将所有样品压缩到单个点,这是GAN中观察到的常见故障模式。然而,直接将蝙蝠applying applying应用于所有层,导致样品振荡和模型不稳定。这是通过不将蝙蝠chnorm应用于发生器输出层和鉴别器输入层而避免的。

The ReLU activation (Nair & Hinton, 2010) is used in the generator with the exception of the output layer which uses the Tanh function. We observed that using a bounded activation allowed the model to learn more quickly to saturate and cover the color space of the training distribution. Within the discriminator we found the leaky rectified activation (Maas et al., 2013) (Xu et al., 2015) to work well, especially for higher resolution modeling. This is in contrast to the original GAN paper, which used the maxout activation (Goodfellow et al., 2013).

ReLU激活(Nair&Hinton,2010)用于发生器,但使用Tanh函数的输出层除外。我们观察到,使用有界激活可使模型更快地学习,以饱和并覆盖训练分布的色彩空间。在鉴别器内部,我们发现泄漏整流激活(Maas et al。,2013)(Xu et al。,2015)能够很好地工作,尤其是对于更高分辨率的建模。这与使用最大激活的原始GAN纸相反(Goodfellow等,2013)。




We trained DCGANs on three datasets, Large-scale Scene Understanding (LSUN) (Yu et al., 2015), Imagenet-1k and a newly assembled Faces dataset. Details on the usage of each of these datasets are given below.


No pre-processing was applied to training images besides scaling to the range of the tanh activation function [-1, 1]. All models were trained with mini-batch stochastic gradient descent (SGD) with a mini-batch size of 128. All weights were initialized from a zero-centered Normal distribution with standard deviation 0.02. In the LeakyReLU, the slope of the leak was set to 0.2 in all models. While previous GAN work has used momentum to accelerate training, we used the Adam optimizer (Kingma & Ba, 2014) with tuned hyperparameters. We found the suggested learning rate of 0.001, to be too high, using 0.0002 instead. Additionally, we found leaving the momentum term

at the suggested value of 0.9 resulted in training oscillation and instability while reducing it to 0.5 helped stabilize training.




*Figure 1: DCGAN generator used for LSUN scene modeling. A 100 dimensional uniform distribution Z is projected to a small spatial extent convolutional representation with many feature maps. A series of four fractionally-strided convolutions (in some recent papers, these are wrongly called deconvolutions) then convert this high level representation into a

pixel image. Notably, no fully connected or pooling layers are used.*



4.1 LSUN

As visual quality of samples from generative image models has improved, concerns of over-fitting and memorization of training samples have risen. To demonstrate how our model scales with more data and higher resolution generation, we train a model on the LSUN bedrooms dataset containing a little over 3 million training examples. Recent analysis has shown that there is a direct link between how fast models learn and their generalization performance (Hardt et al., 2015). We show samples from one epoch of training (Fig.2), mimicking online learning, in addition to samples after convergence (Fig.3), as an opportunity to demonstrate that our model is not producing high quality samples via simply overfitting/memorizing training examples. No data augmentation was applied to the images.




To further decrease the likelihood of the generator memorizing input examples (Fig.2) we perform a simple image de-duplication process. We fit a 3072-128-3072 de-noising dropout regularized RELU autoencoder on 32x32 downsampled center-crops of training examples. The resulting code layer activations are then binarized via thresholding the ReLU activation which has been shown to be an effective information preserving technique (Srivastava et al., 2014) and provides a convenient form of semantic-hashing, allowing for linear time de-duplication. Visual inspection of hash collisions showed high precision with an estimated false positive rate of less than 1 in 100. Additionally, the technique detected and removed approximately 275,000 near duplicates, suggesting a high recall.




We scraped images containing human faces from random web image queries of peoples names. The people names were acquired from dbpedia, with a criterion that they were born in the modern era. This dataset has 3M images from 10K people. We run an OpenCV face detector on these images, keeping the detections that are sufficiently high resolution, which gives us approximately 350,000 face boxes. We use these face boxes for training. No data augmentation was applied to the images.



Figure 2: Generated bedrooms after one training pass through the dataset. Theoretically, the model could learn to memorize training examples, but this is experimentally unlikely as we train with a small learning rate and minibatch SGD. We are aware of no prior empirical evidence demonstrating memorization with SGD and a small learning rate.



Figure 3: Generated bedrooms after five epochs of training. There appears to be evidence of visual under-fitting via repeated noise textures across multiple samples such as the base boards of some of the beds.




We use Imagenet-1k (Deng et al., 2009) as a source of natural images for unsupervised training. We train on

min-resized center crops. No data augmentation was applied to the images.

我们使用Imagenet-1k(Deng et al。,2009)作为无监督训练的自然图像源。我们在



5 DCGANS能力的经验验证



One common technique for evaluating the quality of unsupervised representation learning algorithms is to apply them as a feature extractor on supervised datasets and evaluate the performance of linear models fitted on top of these features.


On the CIFAR-10 dataset, a very strong baseline performance has been demonstrated from a well tuned single layer feature extraction pipeline utilizing K-means as a feature learning algorithm. When using a very large amount of feature maps (4800) this technique achieves 80.6% accuracy. An unsupervised multi-layered extension of the base algorithm reaches 82.0% accuracy (Coates & Ng, 2011). To evaluate the quality of the representations learned by DCGANs for supervised tasks, we train on Imagenet-1k and then use the discriminator’s convolutional features from all layers, maxpooling each layers representation to produce a

spatial grid. These features are then flattened and concatenated to form a 28672 dimensional vector and a regularized linear L2-SVM classifier is trained on top of them. This achieves 82.8% accuracy, out performing all K-means based approaches. Notably, the discriminator has many less feature maps (512 in the highest layer) compared to K-means based techniques, but does result in a larger total feature vector size due to the many layers of

spatial locations. The performance of DCGANs is still less than that of Exemplar CNNs (Dosovitskiy et al., 2015), a technique which trains normal discriminative CNNs in an unsupervised fashion to differentiate between specifically chosen, aggressively augmented, exemplar samples from the source dataset.Further improvements could be made by finetuning the discriminator’s representations, but we leave this for future work. Additionally, since our DCGAN was never trained on CIFAR-10 this experiment also demonstrates the domain robustness of the learned features.



空间位置的许多层,确实导致较大的总特征向量大小。DCGANs的性能仍然低于Exemplar CNN(Dosovitskiy等,2015),该技术以无监督的方式训练正常的区分性CNN,以区分源数据集中特定选择的,主动增强的示例性样本。通过对鉴别器的表示进行网络化可以进一步改进,但我们将其留作未来工作。此外,由于我们的DCGAN从未在CIFAR-10上进行过培训,因此本实验还显示了学习功能的域稳健性。

Table 1: CIFAR-10 classification results using our pre-trained model. Our DCGAN is not pretrained on CIFAR-10, but on Imagenet-1k, and the features are used to classify CIFAR-10 images.





On the StreetView House Numbers dataset (SVHN)(Netzer et al., 2011), we use the features of the discriminator of a DCGAN for supervised purposes when labeled data is scarce. Following similar dataset preparation rules as in the CIFAR-10 experiments, we split off a validation set of 10,000 examples from the non-extra set and use it for all hyperparameter and model selection. 1000 uniformly class distributed training examples are randomly selected and used to train a regularized linear L2-SVM classifier on top of the same feature extraction pipeline used for CIFAR-10. This achieves state of the art (for classification using 1000 labels) at 22.48% test error, improving upon another modifcation of CNNs designed to leverage unlabled data (Zhao et al., 2015). Additionally, we validate that the CNN architecture used in DCGAN is not the key contributing factor of the model’s performance by training a purely supervised CNN with the same architecture on the same data and optimizing this model via random search over 64 hyperparameter trials (Bergstra & Bengio, 2012). It achieves a signficantly higher 28.87% validation error.

在StreetView House Numbers数据集(SVHN)(Netzer et al。,2011)中,当标记数据稀缺时,我们将DCGAN的鉴别器的特性用于监督目的。按照与CIFAR-10实验类似的数据集准备规则,我们从非额外集合中分离出10,000个实例的验证集,并将其用于所有超参数和模型选择。随机选择1000个均匀分布的分布式训练样本,并用于在用于CIFAR-10的相同特征提取流水线之上训练一个正则化的线性L2-SVM分类器。这实现了最先进的技术(用1000个标签进行分类),测试误差为22.48%,改进了CNN的另一种修改,旨在利用非标记数据(Zhao et al。,2015)。此外,我们通过在相同数据上训练具有相同架构的纯监督CNN并通过随机搜索优化该模型超过64个超参数试验(Bergstra&Bengio),验证DCGAN中使用的CNN架构不是模型性能的关键贡献因素,2012)。它实现了高达28.87%的验证错误。



We investigate the trained generators and discriminators in a variety of ways. We do not do any kind of nearest neighbor search on the training set. Nearest neighbors in pixel or feature space are trivially fooled (Theis et al., 2015) by small image transforms. We also do not use log-likelihood metrics to quantitatively assess the model, as it is a poor (Theis et al., 2015) metric.

我们以各种方式调查受过训练的发生器和鉴别器。我们不在训练集上进行任何类型的最近邻搜索。通过小图像变换,像素或特征空间中最近的邻居被平凡地愚弄(Theis et al。,2015)。我们也不使用对数似然度量来定量评估模型,因为它是一个很差的(Theis et al。,2015)度量。

Table 2: SVHN classification with 1000 labels





The first experiment we did was to understand the landscape of the latent space. Walking on the manifold that is learnt can usually tell us about signs of memorization (if there are sharp transitions) and about the way in which the space is hierarchically collapsed. If walking in this latent space results in semantic changes to the image generations (such as objects being added and removed), we can reason that the model has learned relevant and interesting representations. The results are shown in Fig.4.




Previous work has demonstrated that supervised training of CNNs on large image datasets results in very powerful learned features (Zeiler & Fergus, 2014). Additionally, supervised CNNs trained on scene classification learn object detectors (Oquab et al., 2014). We demonstrate that an unsupervised DCGAN trained on a large image dataset can also learn a hierarchy of features that are interesting.Using guided backpropagation as proposed by (Springenberg et al., 2014), we show in Fig.5 that the features learnt by the discriminator activate on typical parts of a bedroom, like beds and windows. For comparison, in the same figure, we give a baseline for randomly initialized features that are not activated on anything that is semantically relevant or interesting.

以前的工作已经证明,对大图像数据集进行有监督的CNN培训会产生非常强大的学习功能(Zeiler&Fergus,2014)。此外,受监督的CNN在场景分类方面进行了培训,学习了物体探测器(Oquab等,2014)。我们证明在大图像数据集上训练的无监督DCGAN也可以学习有趣的功能层次结构。使用(Springenberg et al。,2014)提出的引导式反向传播,我们在图5中显示,鉴别器学习的特征在卧室的典型部分(如床和窗)上激活。为了比较,在同一图中,我们给出了随机初始化特征的基线,这些特征在语义上相关或有趣的任何事物上都未被激活。





In addition to the representations learnt by a discriminator, there is the question of what representations the generator learns. The quality of samples suggest that the generator learns specific object representations for major scene components such as beds, windows, lamps, doors, and miscellaneous furniture. In order to explore the form that these representations take, we conducted an experiment to attempt to remove windows from the generator completely.


On 150 samples, 52 window bounding boxes were drawn manually. On the second highest convolution layer features, logistic regression was fit to predict whether a feature activation was on a window (or not), by using the criterion that activations inside the drawn bounding boxes are positives and random samples from the same images are negatives. Using this simple model, all feature maps with weights greater than zero ( 200 in total) were dropped from all spatial locations. Then, random new samples were generated with and without the feature map removal.


The generated images with and without the window dropout are shown in Fig.6, and interestingly, the network mostly forgets to draw windows in the bedrooms, replacing them with other objects.



Figure 4: Top rows: Interpolation between a series of 9 random points in Z show that the space learned has smooth transitions, with every image in the space plausibly looking like a bedroom. In the 6th row, you see a room without a window slowly transforming into a room with a giant window. In the 10th row, you see what appears to be a TV slowly being transformed into a window.




In the context of evaluating learned representations of words (Mikolov et al., 2013) demonstrated that simple arithmetic operations revealed rich linear structure in representation space. One canonical example demonstrated that the vector(”King”) - vector(”Man”) + vector(”Woman”) resulted in a vector whose nearest neighbor was the vector for Queen. We investigated whether similar structure emerges in the Z representation of our generators. We performed similar arithmetic on the Z vectors of sets of exemplar samples for visual concepts. Experiments working on only single samples per concept were unstable, but averaging the Z vector for three examplars showed consistent and stable generations that semantically obeyed the arithmetic. In addition to the object manipulation shown in (Fig. 7), we demonstrate that face pose is also modeled linearly in Z space (Fig. 8).

在评估词汇的学习表征(Mikolov等,2013)中,证明了简单的算术运算揭示了表征空间中丰富的线性结构。一个典型的例子表明,矢量(“国王”) - 矢量(“人”)+矢量(“女人”)产生了一个矢量,其最近的邻居是女王的矢量。我们调查了在我们的发电机的Z表示中是否出现类似的结构。我们对视觉概念的示例样本集的Z向量执行类似的算术。每个概念仅对单个样本进行实验的实验是不稳定的,但对三个样本的平均Z向量显示了语义上服从算术的一致且稳定的世代。除了(图7)所示的对象操作外,我们还证明了在Z空间中线性模拟人脸姿态(图8)。

These demonstrations suggest interesting applications can be developed using Z representations learned by our models. It has been previously demonstrated that conditional generative models can learn to convincingly model object attributes like scale, rotation, and position (Dosovitskiy et al., 2014). This is to our knowledge the first demonstration of this occurring in purely unsupervised models. Further exploring and developing the above mentioned vector arithmetic could dramatically reduce the amount of data needed for conditional generative modeling of complex image distributions.

这些演示表明可以使用我们的模型学习到的Z表示来开发有趣的应用程序。先前已经证明,条件生成模型可以学会令人信服地模拟对象属性,如规模,旋转和位置(Dosovitskiy et al。,2014)。这是我们的知识,这是纯粹无监督模型中的第一次演示。进一步探索和开发上述向量算法可以显着减少复杂图像分布的条件生成建模所需的数据量。


Figure 5: On the right, guided backpropagation visualizations of maximal axis-aligned responses for the first 6 learned convolutional features from the last convolution layer in the discriminator. Notice a significant minority of features respond to beds - the central object in the LSUN bedrooms dataset. On the left is a random filter baseline. Comparing to the previous responses there is little to no discrimination and random structure.

图5:在右侧,针对来自鉴别器中最后卷积层的前6个学习卷积特征的最大轴对齐响应的反向传播可视化。注意一些重要特征对床的响应 - LSUN卧室数据集中的中心对象。左边是一个随机过滤器基线。与之前的回应相比,几乎没有歧视和随机结构。


Figure 6: Top row: un-modified samples from model. Bottom row: the same samples generated with dropping out ”window” filters. Some windows are removed, others are transformed into objects with similar visual appearance such as doors and mirrors. Although visual quality decreased, overall scene composition stayed similar, suggesting the generator has done a good job disentangling scene representation from object representation. Extended experiments could be done to remove other objects from the image and modify the objects the generator draws.




We propose a more stable set of architectures for training generative adversarial networks and we give evidence that adversarial networks learn good representations of images for supervised learning and generative modeling. There are still some forms of model instability remaining - we noticed as models are trained longer they sometimes collapse a subset of filters to a single oscillating mode.

我们提出了一套更稳定的架构来训练生成对抗网络,并且我们给出证据表明敌对网络学习了监督学习和生成建模的良好图像表示。仍然存在一些形式的模型不稳定性 - 我们注意到随着模型训练时间更长,它们有时会将一部分滤波器折叠成单个振荡模式。

Figure 7: Vector arithmetic for visual concepts. For each column, the Z vectors of samples are averaged.Arithmetic was then performed on the mean vectors creating a new vector Y. The center sample on the right hand side is produce by feeding Y as input to the generator. To demonstrate the interpolation capabilities of the generator, uniform noise sampled with scale +-0.25 was added to Y to produce the 8 other samples. Applying arithmetic in the input space (bottom two examples) results in noisy overlap due to misalignment.

图7:视觉概念的矢量算法。对于每列,对样本的Z向量进行平均。然后对均值向量进行算术运算,创建一个新的向量Y.右侧的中心样品是通过将Y作为输入发送到发生器而生产的。为了演示发生器的内插能力,将采用比例+ -0.25采样的均匀噪声添加到Y以产生另外8个采样。在输入空间中应用算术(下面的两个示例)会导致由于未对齐而产生的噪音重叠。

Further work is needed to tackle this from of instability. We think that extending this framework to other domains such as video (for frame prediction) and audio (pre-trained features for speech synthesis) should be very interesting. Further investigations into the properties of the learnt latent space would be interesting as well.



Figure 8: A ”turn” vector was created from four averaged samples of faces looking left vs looking right.By adding interpolations along this axis to random samples we were able to reliably transform their pose.


