The paper proposes a new text recognition network for scene-text images. Many state-of-the-art methods employ the attention mechanism either in the text encoder or decoder for the text alignment. Although the encoder-based attention yields promising results, these schemes inherit noticeable limitations. They perform the feature extraction (FE) and visual attention (VA) sequentially, which bounds the attention mechanism to rely only on the FE final single-scale output. Moreover, the utilization of the attention process is limited by only applying it directly to the single scale feature-maps. To address these issues, we propose a new multi-scale and encoder-based attention network for text recognition that performs the multi-scale FE and VA in parallel. The multi-scale channels also undergo regular fusion with each other to develop the coordinated knowledge together. Quantitative evaluation and robustness analysis on the standard benchmarks demonstrate that the proposed network outperforms the state-of-the-art in most cases.
翻译:本文为现场文字图像建议了新的文本识别网络。 许多最先进的方法在文本编码器或编码器中采用关注机制来对文本进行校正。 虽然以编码器为基础的关注产生了有希望的结果,但这些计划具有明显的局限性。它们依次执行特征提取(FE)和视觉关注(VA),使关注机制只能依赖FE最终的单一尺度产出。此外,对关注过程的利用受到限制,仅将关注过程直接应用到单一规模的特征图案中。为了解决这些问题,我们建议建立一个新的多尺度和编码器的注意网络,以同时进行多尺度FE和VA的文本识别。多尺度的渠道也经常相互融合,以共同开发协调的知识。对标准基准进行定量评估和强力分析表明,拟议的网络在大多数情况下都超越了最新技术。