通过语义一致性改进视觉故事的生成和评估 (Improving Generation and Evaluation of Visual Stories via Semantic Consistency)

Story visualization is an under-explored task that falls at the intersection of many important research directions in both computer vision and natural language processing. In this task, given a series of natural language captions which compose a story, an agent must generate a sequence of images that correspond to the captions. Prior work has introduced recurrent generative models which outperform text-to-image synthesis models on this task. However, there is room for improvement of generated images in terms of visual quality, coherence and relevance. We present a number of improvements to prior modeling approaches, including (1) the addition of a dual learning framework that utilizes video captioning to reinforce the semantic alignment between the story and generated images, (2) a copy-transform mechanism for sequentially-consistent story visualization, and (3) MART-based transformers to model complex interactions between frames. We present ablation studies to demonstrate the effect of each of these techniques on the generative power of the model for both individual images as well as the entire narrative. Furthermore, due to the complexity and generative nature of the task, standard evaluation metrics do not accurately reflect performance. Therefore, we also provide an exploration of evaluation metrics for the model, focused on aspects of the generated frames such as the presence/quality of generated characters, the relevance to captions, and the diversity of the generated images. We also present correlation experiments of our proposed automated metrics with human evaluations. Code and data available at: https://github.com/adymaharana/StoryViz

翻译：故事视觉化是一项未得到充分探讨的任务,是计算机视觉和自然语言处理方面许多重要研究方向的交汇点。在这项任务中,鉴于一系列自然语言说明构成一个故事,一个代理必须产生一系列与标题对应的图像序列;先前的工作已经引入了反复出现的基因化模型,这些模型优于关于这项任务的文本到图像综合模型。然而,在视觉质量、一致性和相关性方面,产生的图像仍有改进的余地。我们介绍了先前的自动模型方法的一些改进,包括:(1) 增加一个双重学习框架,利用视频说明加强故事和生成图像之间的语义一致性;(2) 一个复制的、与顺序一致的故事直观视觉化的变异机制;(3) 以MART为基础的变异器,用于模拟各种框架之间的复杂互动。我们介绍了这些技术对模型在视觉质量、一致性和整个描述方面的影响。此外,由于任务的复杂性和基因化性质,标准评价指标指标没有准确反映图像和生成图像的语义性。因此,我们还提供了一个以数据质量为核心的模型/格式化的模型,我们所生成的模型在目前/格式上产生的数据质量的模型/格式方面。

相关内容

MoDELS

关注 30

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

知识驱动的视觉知识学习，以VQA视觉问答为例，31页ppt

专知会员服务

35+阅读 · 2020年9月25日

【CVPR2020】通过自适应GANs生成不同的图像，Diverse Image Generation via Self-Conditioned GANs

专知会员服务

33+阅读 · 2020年6月19日

【视频描述综述论文】Video Description: A Survey of Methods, Datasets, and Evaluation Metrics

专知会员服务

64+阅读 · 2020年5月12日

因果图，Causal Graphs，52页ppt

专知会员服务

241+阅读 · 2020年4月19日