Texts in scene images convey critical information for scene understanding and reasoning. The abilities of reading and reasoning matter for the model in the text-based visual question answering (TextVQA) process. However, current TextVQA models do not center on the text and suffer from several limitations. The model is easily dominated by language biases and optical character recognition (OCR) errors due to the absence of semantic guidance in the answer prediction process. In this paper, we propose a novel Semantics-Centered Network (SC-Net) that consists of an instance-level contrastive semantic prediction module (ICSP) and a semantics-centered transformer module (SCT). Equipped with the two modules, the semantics-centered model can resist the language biases and the accumulated errors from OCR. Extensive experiments on TextVQA and ST-VQA datasets show the effectiveness of our model. SC-Net surpasses previous works with a noticeable margin and is more reasonable for the TextVQA task.
翻译:现场图像中的文字传递了用于现场理解和推理的关键信息。 在基于文本的视觉问答(TextVQA)过程中, 读和推理能力对于模型来说很重要。 但是, 目前的文本VQA模型并不以文本为中心, 并且受到若干限制。 该模型很容易被语言偏向和光学字符识别(OCR)错误所控制, 原因是在回答预测过程中缺少语义指导。 在本文中, 我们建议建立一个新型的语义- 中心网络(SC- Net), 由实例级对比语义预测模块(ICSP) 和一个语义- 以语义为中心的变异模块( SCT) 组成。 以语义为中心的模型与两个模块相配, 能够抵制语言偏差和来自 OCR 的累积错误。 在 TextVQA 和 ST- VQA 数据集上的广泛实验显示了我们的模型的有效性。 SC- Net 超越了先前的工程, 并且对于 TextVQA 任务来说更加合理。