用于从扫描财务文件图像中检测表格和抽取表格数据的深结构特征网络 (Deep Structured Feature Networks for Table Detection and Tabular Data Extraction from Scanned Financial Document Images) - 专知论文

会员服务 ·

0

SCAN · Networking · Automator · MoDELS · OCR（光学字符识别） ·

2021 年 2 月 20 日

Deep Structured Feature Networks for Table Detection and Tabular Data Extraction from Scanned Financial Document Images

翻译：用于从扫描财务文件图像中检测表格和抽取表格数据的深结构特征网络

Siwen Luo,Mengting Wu,Yiwen Gong,Wanying Zhou,Josiah Poon

Automatic table detection in PDF documents has achieved a great success but tabular data extraction are still challenging due to the integrity and noise issues in detected table areas. The accurate data extraction is extremely crucial in finance area. Inspired by this, the aim of this research is proposing an automated table detection and tabular data extraction from financial PDF documents. We proposed a method that consists of three main processes, which are detecting table areas with a Faster R-CNN (Region-based Convolutional Neural Network) model with Feature Pyramid Network (FPN) on each page image, extracting contents and structures by a compounded layout segmentation technique based on optical character recognition (OCR) and formulating regular expression rules for table header separation. The tabular data extraction feature is embedded with rule-based filtering and restructuring functions that are highly scalable. We annotate a new Financial Documents dataset with table regions for the experiment. The excellent table detection performance of the detection model is obtained from our customized dataset. The main contributions of this paper are proposing the Financial Documents dataset with table-area annotations, the superior detection model and the rule-based layout segmentation technique for the tabular data extraction from PDF files.

翻译：PDF文件的自动表格检测取得了巨大成功,但由于所探测的表格区域的完整性和噪音问题,表格数据提取仍然具有挑战性。准确的数据提取在财务领域极为关键。受此启发,本研究的目的是提出自动表格检测和从金融PDF文件抽取表格数据。我们提出了由三个主要流程组成的方法,即检测表格区域,以快速 R-CNN (基于区域的革命神经网络)为模型,每个页面图像都带有地貌金字网(FPN),通过基于光学字符识别(OCR)的复合版面分割技术提取内容和结构,并为表格页头分离制定常规表达规则。表格数据提取功能与基于规则的过滤和结构重组功能嵌入了高度可缩放的功能。我们注意到与用于实验的表格区域的新的财务文件数据集。检测模型的出色表格检测性能是从我们定制的数据集中得来的。本文的主要贡献是提出带有表域说明的金融文档数据集、高级检测模型和基于规则的布局分割技术,用于从PDFF文件中提取表格数据。

1

相关内容

SCAN

【ICML2020】用于图结构化数据的卷积核网络，Convolutional Kernel Networks for Graph-Structured Data

【ICML2020】用于图结构化数据的卷积核网络，Convolutional Kernel Networks for Graph-Structured Data

专知会员服务

44+阅读 · 2020年6月29日

商业数据分析，39页ppt

商业数据分析，39页ppt

专知会员服务

165+阅读 · 2020年6月2日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

【深度学习架构、模型和技巧集合(TensorFlow/PyTorch)】’Deep Learning Models - A collection of various deep learning architectures, models, and tips'

【深度学习架构、模型和技巧集合(TensorFlow/PyTorch)】’Deep Learning Models - A collection of various deep learning architectures, models, and tips'

专知会员服务

59+阅读 · 2020年1月25日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

38+阅读 · 2020年1月23日

【新书】深度学习搜索，Deep Learning for Search，附327页pdf

【新书】深度学习搜索，Deep Learning for Search，附327页pdf

专知会员服务

213+阅读 · 2020年1月13日

面向机器学习和数据分析的特征工程（Feature Engineering for Machine Learning and Data Analytics），附新书419页pdf

面向机器学习和数据分析的特征工程（Feature Engineering for Machine Learning and Data Analytics），附新书419页pdf

专知会员服务

62+阅读 · 2019年10月26日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

计算机视觉最佳实践、代码示例和相关文档

计算机视觉最佳实践、代码示例和相关文档

专知会员服务

20+阅读 · 2019年10月9日

【ECML-PKDD 2019】从AIS数据中发现隐藏的概念:一种用于异常检测的海上交通网络抽象（Uncovering hidden concepts from AIS data: A network abstraction of maritime traffic for anomaly detection）

【ECML-PKDD 2019】从AIS数据中发现隐藏的概念:一种用于异常检测的海上交通网络抽象（Uncovering hidden concepts from AIS data: A network abstraction of maritime traffic for anomaly detection）

专知会员服务

22+阅读 · 2019年9月16日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知

133+阅读 · 2020年3月18日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

已删除

将门创投

4+阅读 · 2017年12月12日

【推荐】(TensorFlow)SSD实时手部检测与追踪（附代码）

【推荐】(TensorFlow)SSD实时手部检测与追踪（附代码）

机器学习研究会

11+阅读 · 2017年12月5日

【推荐】YOLO实时目标检测(6fps)

【推荐】YOLO实时目标检测(6fps)

机器学习研究会

20+阅读 · 2017年11月5日

【ICCV 2017论文集】计算机视觉顶级会议ICCV2017 Open Access Repository

【ICCV 2017论文集】计算机视觉顶级会议ICCV2017 Open Access Repository

专知

6+阅读 · 2017年10月14日

【推荐】深度学习目标检测概览

【推荐】深度学习目标检测概览

机器学习研究会

10+阅读 · 2017年9月1日

Acoustic Structure Inverse Design and Optimization Using Deep Learning

Acoustic Structure Inverse Design and Optimization Using Deep Learning

Arxiv

0+阅读 · 2021年4月15日

Reverse Attention for Salient Object Detection

Arxiv

11+阅读 · 2019年4月15日

Deep Network Embedding for Graph Representation Learning in Signed Networks

Arxiv

4+阅读 · 2019年1月7日

Tiny-DSOD: Lightweight Object Detection for Resource-Restricted Usages

Tiny-DSOD: Lightweight Object Detection for Resource-Restricted Usages

Arxiv

5+阅读 · 2018年7月29日

Document Image Classification with Intra-Domain Transfer Learning and Stacked Generalization of Deep Convolutional Neural Networks

Document Image Classification with Intra-Domain Transfer Learning and Stacked Generalization of Deep Convolutional Neural Networks

Arxiv

3+阅读 · 2018年6月26日

A Convolutional Feature Map based Deep Network targeted towards Traffic Detection and Classification

Arxiv

3+阅读 · 2018年5月22日

Contrast-Oriented Deep Neural Networks for Salient Object Detection

Arxiv

6+阅读 · 2018年3月30日

3D-SSD: Learning Hierarchical Features from RGB-D Images for Amodal 3D Object Detection

Arxiv

8+阅读 · 2018年2月21日

DOTA: A Large-scale Dataset for Object Detection in Aerial Images

Arxiv

19+阅读 · 2018年1月27日

MSDNN: Multi-Scale Deep Neural Network for Salient Object Detection

Arxiv

21+阅读 · 2018年1月12日

VIP会员

文章信息

相关主题

OCR（光学字符识别）

相关VIP内容

【ICML2020】用于图结构化数据的卷积核网络，Convolutional Kernel Networks for Graph-Structured Data

【ICML2020】用于图结构化数据的卷积核网络，Convolutional Kernel Networks for Graph-Structured Data

专知会员服务

44+阅读 · 2020年6月29日

商业数据分析，39页ppt

商业数据分析，39页ppt

专知会员服务

165+阅读 · 2020年6月2日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

【深度学习架构、模型和技巧集合(TensorFlow/PyTorch)】’Deep Learning Models - A collection of various deep learning architectures, models, and tips'

【深度学习架构、模型和技巧集合(TensorFlow/PyTorch)】’Deep Learning Models - A collection of various deep learning architectures, models, and tips'

专知会员服务

59+阅读 · 2020年1月25日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

38+阅读 · 2020年1月23日

【新书】深度学习搜索，Deep Learning for Search，附327页pdf

【新书】深度学习搜索，Deep Learning for Search，附327页pdf

专知会员服务

213+阅读 · 2020年1月13日

面向机器学习和数据分析的特征工程（Feature Engineering for Machine Learning and Data Analytics），附新书419页pdf

面向机器学习和数据分析的特征工程（Feature Engineering for Machine Learning and Data Analytics），附新书419页pdf

专知会员服务

62+阅读 · 2019年10月26日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

计算机视觉最佳实践、代码示例和相关文档

计算机视觉最佳实践、代码示例和相关文档

专知会员服务

20+阅读 · 2019年10月9日

【ECML-PKDD 2019】从AIS数据中发现隐藏的概念:一种用于异常检测的海上交通网络抽象（Uncovering hidden concepts from AIS data: A network abstraction of maritime traffic for anomaly detection）

【ECML-PKDD 2019】从AIS数据中发现隐藏的概念:一种用于异常检测的海上交通网络抽象（Uncovering hidden concepts from AIS data: A network abstraction of maritime traffic for anomaly detection）

专知会员服务

22+阅读 · 2019年9月16日

热门VIP内容

开通专知VIP会员享更多权益服务

《超视距空战强化学习智能体的深度学习表征能力评估》最新70页

《第一人称视角无人机革命及其对陆战与其它战争维度的影响》最新19页报告

从兵棋推演到真实战场：人工智能指挥官在实战中的崛起

《小型无人机系统空域管理与控制：美陆军指挥官手册》最新34页

相关资讯

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知

133+阅读 · 2020年3月18日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

已删除

将门创投

4+阅读 · 2017年12月12日

【推荐】(TensorFlow)SSD实时手部检测与追踪（附代码）

【推荐】(TensorFlow)SSD实时手部检测与追踪（附代码）

机器学习研究会

11+阅读 · 2017年12月5日

【推荐】YOLO实时目标检测(6fps)

【推荐】YOLO实时目标检测(6fps)

机器学习研究会

20+阅读 · 2017年11月5日

【ICCV 2017论文集】计算机视觉顶级会议ICCV2017 Open Access Repository

【ICCV 2017论文集】计算机视觉顶级会议ICCV2017 Open Access Repository

专知

6+阅读 · 2017年10月14日

【推荐】深度学习目标检测概览

【推荐】深度学习目标检测概览

机器学习研究会

10+阅读 · 2017年9月1日

相关论文

Acoustic Structure Inverse Design and Optimization Using Deep Learning

Acoustic Structure Inverse Design and Optimization Using Deep Learning

Arxiv

0+阅读 · 2021年4月15日

Reverse Attention for Salient Object Detection

Arxiv

11+阅读 · 2019年4月15日

Deep Network Embedding for Graph Representation Learning in Signed Networks

Arxiv

4+阅读 · 2019年1月7日

Tiny-DSOD: Lightweight Object Detection for Resource-Restricted Usages

Tiny-DSOD: Lightweight Object Detection for Resource-Restricted Usages

Arxiv

5+阅读 · 2018年7月29日

Document Image Classification with Intra-Domain Transfer Learning and Stacked Generalization of Deep Convolutional Neural Networks

Document Image Classification with Intra-Domain Transfer Learning and Stacked Generalization of Deep Convolutional Neural Networks

Arxiv

3+阅读 · 2018年6月26日

A Convolutional Feature Map based Deep Network targeted towards Traffic Detection and Classification

Arxiv

3+阅读 · 2018年5月22日

Contrast-Oriented Deep Neural Networks for Salient Object Detection

Arxiv

6+阅读 · 2018年3月30日

3D-SSD: Learning Hierarchical Features from RGB-D Images for Amodal 3D Object Detection

Arxiv

8+阅读 · 2018年2月21日

DOTA: A Large-scale Dataset for Object Detection in Aerial Images

Arxiv

19+阅读 · 2018年1月27日

MSDNN: Multi-Scale Deep Neural Network for Salient Object Detection

Arxiv

21+阅读 · 2018年1月12日

微信扫码咨询专知VIP会员