Image Transformer has recently achieved significant progress for natural image understanding, either using supervised (ViT, DeiT, etc.) or self-supervised (BEiT, MAE, etc.) pre-training techniques. In this paper, we propose DiT, a self-supervised pre-trained Document Image Transformer model using large-scale unlabeled text images for Document AI tasks, which is essential since no supervised counterparts ever exist due to the lack of human labeled document images. We leverage DiT as the backbone network in a variety of vision-based Document AI tasks, including document image classification, document layout analysis, table detection as well as text detection for OCR. Experiment results have illustrated that the self-supervised pre-trained DiT model achieves new state-of-the-art results on these downstream tasks, e.g. document image classification (91.11 $\rightarrow$ 92.69), document layout analysis (91.0 $\rightarrow$ 94.9), table detection (94.23 $\rightarrow$ 96.55) and text detection for OCR (93.07 $\rightarrow$ 94.29). The code and pre-trained models are publicly available at \url{https://aka.ms/msdit}.
翻译:最近,图像变异器在了解自然图像方面取得了显著进展,或者使用了受监督(ViT、DeitT等)或自我监督(BEIT、MAE等)的训练前技术。在本文中,我们提议DiT,这是一个自我监督的经过事先培训的文档图像变异器模型,使用大规模无标签的文档AI任务文本图像,这是十分重要的,因为由于缺乏人类标记的文件图像,从未存在任何受监督的对应方。我们利用DiT作为基于视觉的文档AI任务的主干网,包括文件图像分类、文件布局分析、表格探测以及OCR的文本检测。实验结果表明,自监督的预先培训DIT模型在这些下游任务上取得了新的最新结果,例如文件图像分类(91.11 $\ rightrowrowr$ 92.69)、文件布局分析(91.0 $rightrowr$ 94.9)、表检测(94.23 $\rightrowr$ 96.55)和OCR的文本检测(93.07 $rightrowrmal$ 94.29) 。代码和预培训前模型可公开获得。