We propose a novel multimodal architecture for Scene Text Visual Question Answering (STVQA), named Layout-Aware Transformer (LaTr). The task of STVQA requires models to reason over different modalities. Thus, we first investigate the impact of each modality, and reveal the importance of the language module, especially when enriched with layout information. Accounting for this, we propose a single objective pre-training scheme that requires only text and spatial cues. We show that applying this pre-training scheme on scanned documents has certain advantages over using natural images, despite the domain gap. Scanned documents are easy to procure, text-dense and have a variety of layouts, helping the model learn various spatial cues (e.g. left-of, below etc.) by tying together language and layout information. Compared to existing approaches, our method performs vocabulary-free decoding and, as shown, generalizes well beyond the training vocabulary. We further demonstrate that LaTr improves robustness towards OCR errors, a common reason for failure cases in STVQA. In addition, by leveraging a vision transformer, we eliminate the need for an external object detector. LaTr outperforms state-of-the-art STVQA methods on multiple datasets. In particular, +7.6% on TextVQA, +10.8% on ST-VQA and +4.0% on OCR-VQA (all absolute accuracy numbers).
翻译:我们为Scene Text 视觉问题解答(STVQA)提议了一个名为布局-软件变换器(LaTr)的新式多式联运架构。 STVQA的任务要求模型对不同模式进行理解。 因此, 我们首先调查每种模式的影响, 并揭示语言模块的重要性, 特别是当用布局信息进行丰富时。 由此我们提议了一个单一的客观的预培训计划, 仅需要文字和空间提示。 我们显示, 在扫描文档上应用这个预培训计划比使用自然图像有一定的优势, 尽管域空隙。 被扫描的文件很容易采购、 文本密度和有多种布局, 帮助模型通过将语言和布局信息结合起来学习各种空间提示( 例如左侧, 下面等等 ) 。 与现有方法相比, 我们的方法进行无词汇解码解码, 并显示, 普通化了培训词汇。 我们进一步证明, LaTVQA 中, 改进了对 OCR 错误的稳健度, 这是失败案例的一个共同原因。 此外, 利用了视觉变换器、 Q. 我们消除了外部数据系统 QA 特定的 OVA 方法。