摆脱语言偏见和OCR错误:语义 - 以文字为中心的视觉问题解答 (Towards Escaping from Language Bias and OCR Error: Semantics-Centered Text Visual Question Answering)

Texts in scene images convey critical information for scene understanding and reasoning. The abilities of reading and reasoning matter for the model in the text-based visual question answering (TextVQA) process. However, current TextVQA models do not center on the text and suffer from several limitations. The model is easily dominated by language biases and optical character recognition (OCR) errors due to the absence of semantic guidance in the answer prediction process. In this paper, we propose a novel Semantics-Centered Network (SC-Net) that consists of an instance-level contrastive semantic prediction module (ICSP) and a semantics-centered transformer module (SCT). Equipped with the two modules, the semantics-centered model can resist the language biases and the accumulated errors from OCR. Extensive experiments on TextVQA and ST-VQA datasets show the effectiveness of our model. SC-Net surpasses previous works with a noticeable margin and is more reasonable for the TextVQA task.

翻译：现场图像中的文字传递了用于现场理解和推理的关键信息。在基于文本的视觉问答(TextVQA)过程中, 读和推理能力对于模型来说很重要。但是, 目前的文本VQA模型并不以文本为中心, 并且受到若干限制。该模型很容易被语言偏向和光学字符识别(OCR)错误所控制, 原因是在回答预测过程中缺少语义指导。在本文中, 我们建议建立一个新型的语义- 中心网络(SC- Net), 由实例级对比语义预测模块(ICSP) 和一个语义- 以语义为中心的变异模块( SCT) 组成。以语义为中心的模型与两个模块相配, 能够抵制语言偏差和来自 OCR 的累积错误。在 TextVQA 和 ST- VQA 数据集上的广泛实验显示了我们的模型的有效性。 SC- Net 超越了先前的工程, 并且对于 TextVQA 任务来说更加合理。

相关内容

光学字符识别

关注 43

OCR （Optical Character Recognition，光学字符识别）是指电子设备（例如扫描仪或数码相机）检查纸上打印的字符，通过检测暗、亮的模式确定其形状，然后用字符识别方法将形状翻译成计算机文字的过程；即，针对印刷体字符，采用光学的方式将纸质文档中的文字转换成为黑白点阵的图像文件，并通过识别软件将图像中的文字转换成文本格式，供文字处理软件进一步编辑加工的技术。

【CVPR 2022】基于层次化视觉语言知识蒸馏的开放词汇单阶段检测，Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning

专知会员服务

6+阅读 · 2022年3月19日

【CVPR 2022】基于视觉-语言验证和迭代推理的视觉定位,Open-Vocabulary One-Stage Detection with Hierarchical Visual-Language Knowledge Distillation

专知会员服务

11+阅读 · 2022年3月19日

【CVPR 2022】基于粗粒度和细粒度特征匹配的视频描述评估，EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching

专知会员服务

8+阅读 · 2022年3月19日

NLP必读经典文献100篇

专知会员服务

123+阅读 · 2020年9月8日