WinoGAVIL:挑战愿景和语言模式的有色人协会基准 (WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language Models)

While vision-and-language models perform well on tasks such as visual question answering, they struggle when it comes to basic human commonsense reasoning skills. In this work, we introduce WinoGAViL: an online game to collect vision-and-language associations, (e.g., werewolves to a full moon), used as a dynamic benchmark to evaluate state-of-the-art models. Inspired by the popular card game Codenames, a spymaster gives a textual cue related to several visual candidates, and another player has to identify them. Human players are rewarded for creating associations that are challenging for a rival AI model but still solvable by other human players. We use the game to collect 3.5K instances, finding that they are intuitive for humans (>90% Jaccard index) but challenging for state-of-the-art AI models, where the best model (ViLT) achieves a score of 52%, succeeding mostly where the cue is visually salient. Our analysis as well as the feedback we collect from players indicate that the collected associations require diverse reasoning skills, including general knowledge, common sense, abstraction, and more. We release the dataset, the code and the interactive game, aiming to allow future data collection that can be used to develop models with better association abilities.

翻译：视觉和语言模型在视觉解答等任务上表现良好, 当涉及到基本的人类常识推理技巧时, 视觉和语言模型在视觉解答等任务上表现很好, 当涉及到基本的人类常识推理技能时, 视觉和语言模型很难。在这项工作中, 我们引入了WinoGAVIL: 一个收集视觉和语言协会的在线游戏( 例如, 狼人通到满月), 用作评估最新艺术模型的动态基准。在流行的卡片游戏代码名的启发下, 间谍总监给几个视觉候选人提供了文本提示, 而另一个玩家则必须识别它们。人类玩家因创建对竞争的AI模型具有挑战性的协会而获得奖赏。我们用游戏来收集3.5K实例, 发现它们对人类来说是直观的( > 90%的 Jacccard 指数), 但却是用来评估最新智能模型的动态基准。在这种模型( ViLT) 的启发下, 获得52%的分数, 多数是视觉候选人的成绩。我们的分析以及我们从玩家收集的反馈表明, 收集的协会需要不同的推理学技能, 包括一般的知识、普通感、抽象、和抽象, 我们可以用的游戏和更多的收集。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/