Automatic table detection in PDF documents has achieved a great success but tabular data extraction are still challenging due to the integrity and noise issues in detected table areas. The accurate data extraction is extremely crucial in finance area. Inspired by this, the aim of this research is proposing an automated table detection and tabular data extraction from financial PDF documents. We proposed a method that consists of three main processes, which are detecting table areas with a Faster R-CNN (Region-based Convolutional Neural Network) model with Feature Pyramid Network (FPN) on each page image, extracting contents and structures by a compounded layout segmentation technique based on optical character recognition (OCR) and formulating regular expression rules for table header separation. The tabular data extraction feature is embedded with rule-based filtering and restructuring functions that are highly scalable. We annotate a new Financial Documents dataset with table regions for the experiment. The excellent table detection performance of the detection model is obtained from our customized dataset. The main contributions of this paper are proposing the Financial Documents dataset with table-area annotations, the superior detection model and the rule-based layout segmentation technique for the tabular data extraction from PDF files.
翻译:PDF文件的自动表格检测取得了巨大成功,但由于所探测的表格区域的完整性和噪音问题,表格数据提取仍然具有挑战性。准确的数据提取在财务领域极为关键。受此启发,本研究的目的是提出自动表格检测和从金融PDF文件抽取表格数据。我们提出了由三个主要流程组成的方法,即检测表格区域,以快速 R-CNN (基于区域的革命神经网络)为模型,每个页面图像都带有地貌金字网(FPN),通过基于光学字符识别(OCR)的复合版面分割技术提取内容和结构,并为表格页头分离制定常规表达规则。表格数据提取功能与基于规则的过滤和结构重组功能嵌入了高度可缩放的功能。我们注意到与用于实验的表格区域的新的财务文件数据集。检测模型的出色表格检测性能是从我们定制的数据集中得来的。本文的主要贡献是提出带有表域说明的金融文档数据集、高级检测模型和基于规则的布局分割技术,用于从PDFF文件中提取表格数据。