Attention-based encoder-decoder framework is widely used in the scene text recognition task. However, for the current state-of-the-art(SOTA) methods, there is room for improvement in terms of the efficient usage of local visual and global context information of the input text image, as well as the robust correlation between the scene processing module(encoder) and the text processing module(decoder). In this paper, we propose a Representation and Correlation Enhanced Encoder-Decoder Framework(RCEED) to address these deficiencies and break performance bottleneck. In the encoder module, local visual feature, global context feature, and position information are aligned and fused to generate a small-size comprehensive feature map. In the decoder module, two methods are utilized to enhance the correlation between scene and text feature space. 1) The decoder initialization is guided by the holistic feature and global glimpse vector exported from the encoder. 2) The feature enriched glimpse vector produced by the Multi-Head General Attention is used to assist the RNN iteration and the character prediction at each time step. Meanwhile, we also design a Layernorm-Dropout LSTM cell to improve model's generalization towards changeable texts. Extensive experiments on the benchmarks demonstrate the advantageous performance of RCEED in scene text recognition tasks, especially the irregular ones.
翻译:在现场文本识别任务中,广泛使用基于关注的编码器-编码器-编码器框架。然而,对于当前的最新工艺方法(SOTA),在高效使用输入文本图像的本地视觉和全球背景信息以及现场处理模块(编码器)和文本处理模块(编码器)之间强有力的相关性方面仍有改进的余地。在本文件中,我们建议建立一个“代表与关联增强编码器-编码器-编码器框架(RCEED)”,以解决这些缺陷并打破性能瓶颈。在编码器模块中,地方视觉特征、全球背景特征和位置信息被对齐和连接,以生成一个小尺寸的综合功能图。在解码模块中,使用了两种方法加强现场处理模块与文本特征空间之间的关联性关系。 (1) 解码器初始化以从编码器导出的整体特征和全球透视矢量为指南。(2) 多级总注意公司产生的光矢量特性用于协助 RNNET 和每步阶段的字符预测。同时,我们还设计了“区域定位”系统化模型的升级文本,尤其是“区域定位”系统化模型,我们还设计了“地面定位模型的升级的升级的文本。