Optical Character Recognition (OCR), the task of extracting textual information from scanned documents is a vital and broadly used technology for digitizing and indexing physical documents. Existing technologies perform well for clean documents, but when the document is visually degraded, or when there are non-textual elements, OCR quality can be greatly impacted, specifically due to erroneous detections. In this paper we present an improved detection network with a masking system to improve the quality of OCR performed on documents. By filtering non-textual elements from the image we can utilize document-level OCR to incorporate contextual information to improve OCR results. We perform a unified evaluation on a publicly available dataset demonstrating the usefulness and broad applicability of our method. Additionally, we present and make publicly available our synthetic dataset with a unique hard-negative component specifically tuned to improve detection results, and evaluate the benefits that can be gained from its usage
翻译:从扫描文件中提取文字信息的任务,即光学字符识别(OCR),是从扫描文件中提取文字信息是一项重要和广泛使用的技术,用于物理文件的数字化和索引化; 现有技术对于干净文件而言运作良好,但当文件目视退化或非文字元素出现时,或当文件有非文字元素时,光学字符识别质量会受到很大影响,特别是由于检测错误而受到影响; 在本文件中,我们展示了一个改进的检测网络,并配有一个掩膜系统,以提高文件上光学字符识别质量; 通过过滤图像中的非文字元素,我们可以利用文件级的OCR,纳入背景信息,以改进OCR结果; 我们对可公开获取的数据集进行统一评估,显示我们的方法的有用性和广泛适用性; 此外,我们展示并公开提供我们的合成数据集,其中有一个独特的硬阴性组件,专门加以调整,以改善检测结果,并评价从使用该数据集中获得的惠益。