MUTANT: 视觉问题解答中外分配通用培训范例 (MUTANT: A Training Paradigm for Out-of-Distribution Generalization in Visual Question Answering)

While progress has been made on the visual question answering leaderboards, models often utilize spurious correlations and priors in datasets under the i.i.d. setting. As such, evaluation on out-of-distribution (OOD) test samples has emerged as a proxy for generalization. In this paper, we present \textit{MUTANT}, a training paradigm that exposes the model to perceptually similar, yet semantically distinct \textit{mutations} of the input, to improve OOD generalization, such as the VQA-CP challenge. Under this paradigm, models utilize a consistency-constrained training objective to understand the effect of semantic changes in input (question-image pair) on the output (answer). Unlike existing methods on VQA-CP, \textit{MUTANT} does not rely on the knowledge about the nature of train and test answer distributions. \textit{MUTANT} establishes a new state-of-the-art accuracy on VQA-CP with a $10.57\%$ improvement. Our work opens up avenues for the use of semantic input mutations for OOD generalization in question answering.

翻译：虽然在直观回答引导板上取得了进展,但模型往往在i.d.设置下的数据集中使用虚假的关联和前缀,例如VQA-CP挑战。因此,模型利用一致性限制的培训目标来理解投入(问题图像配对)的语义变化对产出的影响(答案),本文与关于VQA-CP的现有方法不同,我们提出\textit{Mutant},这是一个培训范例,它使模型暴露于输入的视觉相似性,但却在语义上截然不同。\ textit{MUTANT}通过一个10.57美元的问题来建立VQA-CP的新状态的精确性。我们的工作打开了使用静态输入的路径。