语言事项:对现场文字探测和显示的训练前办法监督不力 (Language Matters: A Weakly Supervised Pre-training Approach for Scene Text Detection and Spotting)

Recently, Vision-Language Pre-training (VLP) techniques have greatly benefited various vision-language tasks by jointly learning visual and textual representations, which intuitively helps in Optical Character Recognition (OCR) tasks due to the rich visual and textual information in scene text images. However, these methods cannot well cope with OCR tasks because of the difficulty in both instance-level text encoding and image-text pair acquisition (i.e. images and captured texts in them). This paper presents a weakly supervised pre-training method that can acquire effective scene text representations by jointly learning and aligning visual and textual information. Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features, respectively, as well as a visual-textual decoder that models the interaction among textual and visual features for learning effective scene text representations. With the learning of textual features, the pre-trained model can attend texts in images well with character awareness. Besides, these designs enable the learning from weakly annotated texts (i.e. partial texts in images without text bounding boxes) which mitigates the data annotation constraint greatly. Experiments over the weakly annotated images in ICDAR2019-LSVT show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks, respectively. In addition, the proposed method outperforms existing pre-training techniques consistently across multiple public datasets (e.g., +3.2% and +1.3% for Total-Text and CTW1500).

翻译：最近,视觉语言培训前(VLP)技术通过共同学习视觉和文字演示,极大地帮助了各种视觉语言任务。由于现场文本图像中丰富的视觉和文字信息,这些技术直观地帮助了光学字符识别任务。然而,这些方法无法很好地应对OCR任务,因为在实例一级文本编码和图像-文字配对(即图像和其中的捕获文本)的获取方面存在困难。本文展示了一种监督不力的训练前方法,通过共同学习和统一视觉和文字信息,可以取得有效的现场文字演示。我们的网络包括一个图像编码和识别字符字符识别的文字编码,分别提取视觉和文字文本图像的特征信息。由于学习了文字级文本编码和图像配对(即图像和其中捕获的文本)。此外,这些设计能够从虚弱的文本添加中学习(即部分文本,不包含显示图像的图像编码和字符读懂的文本编码编码,分别提取视觉和文字配对文本的编码,以及视觉-文字缩略式的图像,通过测试前的F-trodual-restreal-tradrodual 和微的I-ral-I-I-trad-I-trad-I-I-tradudududududustr-I-l-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-trad-I-I-I-I-I-I-I-I-I-I-I-I-I-T-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-