【谷歌推出TFGAN】开源的轻量级生成对抗网络库

2017 年 12 月 16 日 GAN生成式对抗网络

原作 Joel Shor 机器感知高级软件工程师
编译自 谷歌开源博客
量子位 出品

一般情况下,训练一个神经网络要先定义一下损失函数,告诉神经网络输出的值离目标值偏差大概多少。举个例子来说,对于图像分类网络所定义的损失函数来说,一旦网络出现错误的分类结果,比如说把狗标记成了猫,就会得到一个高损失值。

不过,不是所有任务都有那么容易定义的损失函数,尤其是那些涉及到人类感知的,比如说图像压缩或者文本转语音系统。

GAN(Generative Adversarial Networks,生成对抗网络),在图像生成文本,超分辨率,帮助机器人学会抓握,提供解决方案这些应用上都取得了巨大的进步。

不过,理论上和软件工程上的更新不够快,跟不上GAN的更新的节奏。


 一段生成模型不断进化的视频

上面的视频可以看出,这个生成模型刚开始只能产生杂乱的噪音,但是最后生成了比较清晰的MNIST数字。

为了让大家更容易地训练和评价GAN,我们提供TFGAN(轻量级GAN库)的源代码。其中包含容易上手的案例,可以充分地展现出TFGAN的表现张力和灵活性。我们还附上了一个示范教程,里面提到了高级的API端口怎么样能快速地用你的数据来训练模型。

 对抗损失对于图像压缩的效果。

顶层是ImageNet数据集里的图,中间那层是传统损失训练出来的图像压缩神经网络压缩和解压后的效果,底层是GAN损失和传统损失一起训练的神经网络效果。可以看得出来,底层的图边缘更锐利,细节更丰富,虽然和原图还是有一定的差距。

当使用端对端的语音合成TacotronTTS网络时,GAN可以增加部分真实的声音特性。如下图所示。

 大多文本转语音(TTS)网络产生的过平滑的声谱图

TacotronTTS可以有效减少生成音频的人工痕迹,出来的语音更真实自然(具体参考,https://arxiv.org/abs/1703.10135)。

TFGAN支持多种主流的实验方法。既有简单的可涵盖大部分GAN案例的函数(只要几行代码,开发者就可以拿自己的数据直接建模了),也有设计独立模块化的特殊GAN函数,你可以随意地组合自己需要的函数,损失、评估、特征、训练函数。

同时,TFGAN也支持搭配其他架构,或者原始的TensorFlow代码。使用了TFGAN搭建的GAN模型,以后底层架构的优化会更加方便。另外,也有大量的已经预置的损失函数或特征函数供开发者选择,不用再花大量时间自己去写。最最最重要的是代码已经被反复测试过了,开发者不用再担心GAN库数据上的错误。

最后,附TFGAN链接:
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/gan

原文链接:
https://opensource.googleblog.com/2017/12/tfgan-lightweight-library-for-generative-adversarial-networks.html

高质量延伸阅读

☞  【最详尽的GAN介绍】王飞跃等:生成式对抗网络 GAN 的研究进展与展望

☞  【智能自动化学科前沿讲习班第1期】王飞跃教授:生成式对抗网络GAN的研究进展与展望

☞  【智能自动化学科前沿讲习班第1期】王坤峰副研究员:GAN与平行视觉

☞  【重磅】平行将成为一种常态:从SimGAN获得CVPR 2017最佳论文奖说起

☞  【平行讲坛】平行图像:图像生成的一个新型理论框架

☞  【“强化学习之父”萨顿】预测学习马上要火,AI将帮我们理解人类意识

☞  【TFGAN】谷歌开源 TFGAN,让训练和评估 GAN 变得更加简单

☞  【学界】英特尔&丰田联合开源城市驾驶模拟器CARLA

☞  【学界】继图像识别后,图像标注系统也被对抗样本攻陷!

☞  【NIPS 2017】清华大学人工智能创新团队在AI对抗性攻防竞赛中获得冠军

☞  【英伟达NIPS论文AI脑洞大开】用GAN让晴天下大雨,小猫变狮子,黑夜转白天

☞  【BicycleGAN】NIPS 2017论文图像转换多样化,大幅提升pix2pix生成图像效果


登录查看更多
点赞 0

The graph embedding (GE) methods have been widely applied for dimensionality reduction of hyperspectral imagery (HSI). However, a major challenge of GE is how to choose proper neighbors for graph construction and explore the spatial information of HSI data. In this paper, we proposed an unsupervised dimensionality reduction algorithm termed spatial-spectral manifold reconstruction preserving embedding (SSMRPE) for HSI classification. At first, a weighted mean filter (WMF) is employed to preprocess the image, which aims to reduce the influence of background noise. According to the spatial consistency property of HSI, the SSMRPE method utilizes a new spatial-spectral combined distance (SSCD) to fuse the spatial structure and spectral information for selecting effective spatial-spectral neighbors of HSI pixels. Then, it explores the spatial relationship between each point and its neighbors to adjusts the reconstruction weights for improving the efficiency of manifold reconstruction. As a result, the proposed method can extract the discriminant features and subsequently improve the classification performance of HSI. The experimental results on PaviaU and Salinas hyperspectral datasets indicate that SSMRPE can achieve better classification accuracies in comparison with some state-of-the-art methods.

点赞 0
阅读1+

Attention modules connecting encoder and decoders have been widely applied in the field of object recognition, image captioning, visual question answering and neural machine translation, and significantly improves the performance. In this paper, we propose a bottom-up gated hierarchical attention (GHA) mechanism for image captioning. Our proposed model employs a CNN as the decoder which is able to learn different concepts at different layers, and apparently, different concepts correspond to different areas of an image. Therefore, we develop the GHA in which low-level concepts are merged into high-level concepts and simultaneously low-level attended features pass to the top to make predictions. Our GHA significantly improves the performance of the model that only applies one level attention, for example, the CIDEr score increases from 0.923 to 0.999, which is comparable to the state-of-the-art models that employ attributes boosting and reinforcement learning (RL). We also conduct extensive experiments to analyze the CNN decoder and our proposed GHA, and we find that deeper decoders cannot obtain better performance, and when the convolutional decoder becomes deeper the model is likely to collapse during training.

点赞 0
阅读2+

Visual relationship detection can bridge the gap between computer vision and natural language for scene understanding of images. Different from pure object recognition tasks, the relation triplets of subject-predicate-object lie on an extreme diversity space, such as \textit{person-behind-person} and \textit{car-behind-building}, while suffering from the problem of combinatorial explosion. In this paper, we propose a context-dependent diffusion network (CDDN) framework to deal with visual relationship detection. To capture the interactions of different object instances, two types of graphs, word semantic graph and visual scene graph, are constructed to encode global context interdependency. The semantic graph is built through language priors to model semantic correlations across objects, whilst the visual scene graph defines the connections of scene objects so as to utilize the surrounding scene information. For the graph-structured data, we design a diffusion network to adaptively aggregate information from contexts, which can effectively learn latent representations of visual relationships and well cater to visual relationship detection in view of its isomorphic invariance to graphs. Experiments on two widely-used datasets demonstrate that our proposed method is more effective and achieves the state-of-the-art performance.

点赞 0
阅读1+

Understanding visual relationships involves identifying the subject, the object, and a predicate relating them. We leverage the strong correlations between the predicate and the (subj,obj) pair (both semantically and spatially) to predict the predicates conditioned on the subjects and the objects. Modeling the three entities jointly more accurately reflects their relationships, but complicates learning since the semantic space of visual relationships is huge and the training data is limited, especially for the long-tail relationships that have few instances. To overcome this, we use knowledge of linguistic statistics to regularize visual model learning. We obtain linguistic knowledge by mining from both training annotations (internal knowledge) and publicly available text, e.g., Wikipedia (external knowledge), computing the conditional probability distribution of a predicate given a (subj,obj) pair. Then, we distill the knowledge into a deep model to achieve better generalization. Our experimental results on the Visual Relationship Detection (VRD) and Visual Genome datasets suggest that with this linguistic knowledge distillation, our model outperforms the state-of-the-art methods significantly, especially when predicting unseen relationships (e.g., recall improved from 8.45% to 19.17% on VRD zero-shot testing set).

点赞 0
阅读1+
小贴士
Top