As deep neural networks become more adept at traditional tasks, many of the most exciting new challenges concern multimodality---observations that combine diverse types, such as image and text. In this paper, we introduce a family of multimodal deep generative models derived from variational bounds on the evidence (data marginal likelihood). As part of our derivation we find that many previous multimodal variational autoencoders used objectives that do not correctly bound the joint marginal likelihood across modalities. We further generalize our objective to work with several types of deep generative model (VAE, GAN, and flow-based), and allow use of different model types for different modalities. We benchmark our models across many image, label, and text datasets, and find that our multimodal VAEs excel with and without weak supervision. Additional improvements come from use of GAN image models with VAE language models. Finally, we investigate the effect of language on learned image representations through a variety of downstream tasks, such as compositionally, bounding box prediction, and visual relation prediction. We find evidence that these image representations are more abstract and compositional than equivalent representations learned from only visual data.
翻译:随着深层神经网络越来越适应传统任务,许多最令人兴奋的新挑战涉及多式观察,这些观察结合了各种类型,例如图像和文本。在本文中,我们引入了从证据的变异界限(数据边际可能性)中产生的一套多式深层基因模型。我们发现,作为我们推论的一部分,许多以前多式变异自动调整器使用的目标没有正确约束各种模式的共同边际可能性。我们进一步推广了我们的目标,即与几种类型的深层基因模型(VAE、GAN和流基模型)合作,并允许使用不同模式的不同模型类型。我们根据许多图像、标签和文本数据集来衡量我们的模型,发现我们的多式VAE在监督方面优异无懈可及。通过使用VAE语言模型,我们发现更多的改进来自GAN图像模型的使用。最后,我们研究了语言对通过一系列下游任务(例如组成、约束框预测和视觉关系预测)学习的图像表达方式的影响。我们发现,这些图像表述比仅仅从视觉数据中学习的相等的表述形式更加抽象和构成。