Translating renderings (e. g. PDFs, scans) into hierarchical document structures is extensively demanded in the daily routines of many real-world applications. However, a holistic, principled approach to inferring the complete hierarchical structure of documents is missing. As a remedy, we developed "DocParser": an end-to-end system for parsing the complete document structure - including all text elements, nested figures, tables, and table cell structures. Our second contribution is to provide a dataset for evaluating hierarchical document structure parsing. Our third contribution is to propose a scalable learning framework for settings where domain-specific data are scarce, which we address by a novel approach to weak supervision that significantly improves the document structure parsing performance. Our experiments confirm the effectiveness of our proposed weak supervision: Compared to the baseline without weak supervision, it improves the mean average precision for detecting document entities by 39.1 % and improves the F1 score of classifying hierarchical relations by 35.8 %.
翻译:在许多真实世界应用程序的日常日常工作中,大量要求将转换成等级文档结构(如PDFs、扫描等)。然而,缺少一种全面、有原则的办法来推断文件的完整等级结构。作为一种补救措施,我们开发了“DocParker”:一个端到端系统,用于分析完整的文档结构,包括所有文字元素、嵌套图、表格和表格单元格结构。我们的第二个贡献是为评价等级文档结构的剖析提供一个数据集。我们的第三个贡献是为缺少特定领域数据的设置提供一个可缩放的学习框架,我们通过新颖的对薄弱的监督方法加以解决,该方法大大改进了文件结构的剖析性。我们的实验证实了我们提议的薄弱监督的有效性:与基线相比,没有薄弱的监督,它提高了测出文件实体的平均精确度39.1%,将等级关系分类的F1分提高35.8%。