In recent years, (retro-)digitizing paper-based files became a major undertaking for private and public archives as well as an important task in electronic mailroom applications. As a first step, the workflow involves scanning and Optical Character Recognition (OCR) of documents. Preservation of document contexts of single page scans is a major requirement in this context. To facilitate workflows involving very large amounts of paper scans, page stream segmentation (PSS) is the task to automatically separate a stream of scanned images into multi-page documents. In a digitization project together with a German federal archive, we developed a novel approach based on convolutional neural networks (CNN) combining image and text features to achieve optimal document separation results. Evaluation shows that our PSS architecture achieves an accuracy up to 93 % which can be regarded as a new state-of-the-art for this task.
翻译:近年来,纸质档案数字化(retro-)成为私人和公共档案的一项重大事业,也是电子邮件室应用程序的一项重要任务。作为第一步,工作流程涉及对文件进行扫描和光学字符识别(OCR),维护单页扫描的文件背景是这方面的一项主要要求。为了便利涉及大量纸张扫描的工作流程,页面流分割(PSS)的任务是将扫描图像流自动分离成多页文件。在数字化项目中,我们与德国联邦档案一起,开发了一种新型方法,其基础是综合神经网络(CNN),将图像和文本特征结合起来,以实现最佳的文件分离结果。评估表明,我们的PSS结构实现了高达93%的准确度,这可以被视为这项任务的新状态。