在吵闹文件上为改进 OCR 遮掩探测器 (Detection Masking for Improved OCR on Noisy Documents)

Optical Character Recognition (OCR), the task of extracting textual information from scanned documents is a vital and broadly used technology for digitizing and indexing physical documents. Existing technologies perform well for clean documents, but when the document is visually degraded, or when there are non-textual elements, OCR quality can be greatly impacted, specifically due to erroneous detections. In this paper we present an improved detection network with a masking system to improve the quality of OCR performed on documents. By filtering non-textual elements from the image we can utilize document-level OCR to incorporate contextual information to improve OCR results. We perform a unified evaluation on a publicly available dataset demonstrating the usefulness and broad applicability of our method. Additionally, we present and make publicly available our synthetic dataset with a unique hard-negative component specifically tuned to improve detection results, and evaluate the benefits that can be gained from its usage

翻译：从扫描文件中提取文字信息的任务,即光学字符识别(OCR),是从扫描文件中提取文字信息是一项重要和广泛使用的技术,用于物理文件的数字化和索引化; 现有技术对于干净文件而言运作良好,但当文件目视退化或非文字元素出现时,或当文件有非文字元素时,光学字符识别质量会受到很大影响,特别是由于检测错误而受到影响; 在本文件中,我们展示了一个改进的检测网络,并配有一个掩膜系统,以提高文件上光学字符识别质量; 通过过滤图像中的非文字元素,我们可以利用文件级的OCR,纳入背景信息,以改进OCR结果; 我们对可公开获取的数据集进行统一评估,显示我们的方法的有用性和广泛适用性; 此外,我们展示并公开提供我们的合成数据集,其中有一个独特的硬阴性组件,专门加以调整,以改善检测结果,并评价从使用该数据集中获得的惠益。

相关内容

光学字符识别

关注 44

OCR （Optical Character Recognition，光学字符识别）是指电子设备（例如扫描仪或数码相机）检查纸上打印的字符，通过检测暗、亮的模式确定其形状，然后用字符识别方法将形状翻译成计算机文字的过程；即，针对印刷体字符，采用光学的方式将纸质文档中的文字转换成为黑白点阵的图像文件，并通过识别软件将图像中的文字转换成文本格式，供文字处理软件进一步编辑加工的技术。

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日