End-to-end text spotting aims to integrate scene text detection and recognition into a unified framework. Dealing with the relationship between the two sub-tasks plays a pivotal role in designing effective spotters. Although Transformer-based methods eliminate the heuristic post-processing, they still suffer from the synergy issue between the sub-tasks and low training efficiency. In this paper, we present DeepSolo, a simple DETR-like baseline that lets a single Decoder with Explicit Points Solo for text detection and recognition simultaneously. Technically, for each text instance, we represent the character sequence as ordered points and model them with learnable explicit point queries. After passing a single decoder, the point queries have encoded requisite text semantics and locations, thus can be further decoded to the center line, boundary, script, and confidence of text via very simple prediction heads in parallel. Besides, we also introduce a text-matching criterion to deliver more accurate supervisory signals, thus enabling more efficient training. Quantitative experiments on public benchmarks demonstrate that DeepSolo outperforms previous state-of-the-art methods and achieves better training efficiency. In addition, DeepSolo is also compatible with line annotations, which require much less annotation cost than polygons. The code is available at https://github.com/ViTAE-Transformer/DeepSolo.
翻译:端到端的文本检测旨在将现场文本检测和识别整合到一个统一的框架中。 处理两个子任务之间的关系在设计有效的显示器方面发挥着关键作用。 虽然基于变换器的方法消除了超常后处理, 但它们仍然受到子任务与低培训效率之间的协同问题的影响。 在本文中, 我们介绍一个简单的 DeepSolo, 这个简单的 DTR 类似 的基线, 使一个带有明确点的解码器同时进行文本检测和识别。 从技术上讲, 我们代表两个子任务之间的关系, 在设计有效的显示两个子任务之间的关系。 在通过一个解码器后, 点查询已经编码了所需的文本语义和位置, 因此可以通过非常简单的预测头平行地进一步解码到文字的中心线、 边界、 脚本和信心。 此外, 我们还引入一个文本匹配标准, 以提供更准确的监督信号, 从而能够进行更有效的培训。 公共基准的量化实验表明, DeepSolo 超越了先前的状态/ 明确点查询。 在通过单一解码后, 点查询器的解算器将更不那么高的成本标准。, 需要一种可兼容性规则 。</s>