自监督字符对齐蒸馏在文本识别中的应用 (Self-supervised Character-to-Character Distillation for Text Recognition)

When handling complicated text images (e.g., irregular structures, low resolution, heavy occlusion, and uneven illumination), existing supervised text recognition methods are data-hungry. Although these methods employ large-scale synthetic text images to reduce the dependence on annotated real images, the domain gap still limits the recognition performance. Therefore, exploring the robust text feature representations on unlabeled real images by self-supervised learning is a good solution. However, existing self-supervised text recognition methods conduct sequence-to-sequence representation learning by roughly splitting the visual features along the horizontal axis, which limits the flexibility of the augmentations, as large geometric-based augmentations may lead to sequence-to-sequence feature inconsistency. Motivated by this, we propose a novel self-supervised Character-to-Character Distillation method, CCD, which enables versatile augmentations to facilitate general text representation learning. Specifically, we delineate the character structures of unlabeled real images by designing a self-supervised character segmentation module. Following this, CCD easily enriches the diversity of local characters while keeping their pairwise alignment under flexible augmentations, using the transformation matrix between two augmented views from images. Experiments demonstrate that CCD achieves state-of-the-art results, with average performance gains of 1.38% in text recognition, 1.7% in text segmentation, 0.24 dB (PSNR) and 0.0321 (SSIM) in text super-resolution. Code will be released soon.

翻译：当处理复杂的文本图像时（例如不规则结构、低分辨率、严重遮挡和不均匀照明），现有的有监督文本识别方法需要大量的数据。虽然这些方法使用大规模合成的文本图像来减少对标注实际图像的依赖性，但域差仍然限制了识别性能。因此，通过自监督学习在未标注的真实图像上探索鲁棒的文本特征表示是一个好的解决方案。然而，现有的自监督文本识别方法通过在水平轴上大致分割视觉特征来进行序列到序列的表示学习，这限制了增广的灵活性，因为大型几何增广可能导致序列到序列的特征不一致性。出于此原因，我们提出了一种新颖的自监督字符对齐蒸馏方法CCD（Character-to-Character Distillation），该方法能够通过自适应增广使得鲁邦文本表示学习更加通用。具体而言，我们通过设计自监督字符分割模块来刻画未标注真实图像的字符结构。随后，CCD通过图像之间的变换矩阵在灵活的增广下轻松丰富本地字符的多样性，同时保持它们的成对对齐。实验表明，CCD在文本识别、文本分割、文本超分辨率三个任务上均达到了最先进的结果，性能平均提高了1.38%、1.7%、0.24 dB（PSNR）和0.0321（SSIM）。代码即将发布。