Tamizhi-Net OCR:利用基于深学习的印刷品特征识别,创建高质量的大型泰米尔-辛哈拉-英语平行公司 (Tamizhi-Net OCR: Creating A Quality Large Scale Tamil-Sinhala-English Parallel Corpus Using Deep Learning Based Printed Character Recognition (PCR)) - 专知论文

会员服务 ·

0

文字识别 · 缩放 · 主成分回归 · 光学字符识别 · Performer ·

2021 年 9 月 13 日

Tamizhi-Net OCR: Creating A Quality Large Scale Tamil-Sinhala-English Parallel Corpus Using Deep Learning Based Printed Character Recognition (PCR)

翻译：Tamizhi-Net OCR:利用基于深学习的印刷品特征识别,创建高质量的大型泰米尔-辛哈拉-英语平行公司

Charangan Vasantharajan,Uthayasanker Thayasivam

from arxiv, 7 Pages

Most of the low resource languages do not have the necessary resources to create even a substantial monolingual corpus. These languages may often be found in government proceedings but mostly in the form of Portable Document Formats (PDFs) that contains legacy fonts. Extracting text from these documents to create a monolingual corpus is challenging due to legacy font usage and printer-friendly encoding which are not optimized for text extraction. Therefore, we propose a simple, automatic, and novel idea that can scale for Tamil, Sinhala, and English languages and many documents. For this purpose, we enhanced the performance of Tesseract 4.1.1 by employing LSTM-based training on many legacy fonts to recognize printed characters in the above languages. Especially, our model detects code-mix text, numbers, and special characters from the printed document. It is shown that this approach can boost the character-level accuracy of Tesseract 4.1.1 from 85.5 to 98.2 for Tamil (+12.9% relative change) and 91.8 to 94.8 for Sinhala (+3.26% relative change) on a dataset that is considered as challenging by its authors.

翻译：大多数低资源语言都不具备必要的资源来创建即使是实质性的单语版。这些语言通常在政府程序中找到,但大多以含有遗留字体的便携式文档格式(PDFs)的形式出现。从这些文件中提取文本以创建单语版本具有挑战性,因为传统的字体使用和打印机友好型编码对于文本提取来说并非最优化。因此,我们提出了一个简单、自动和新颖的理念,可以推广泰米尔语、僧伽罗语、英语和许多文件。为此,我们利用基于LSTM的许多遗留字体的LSTM培训,以识别上述语言中的印刷字符,加强了Tesseract 4.1.1的性格精度,从85.5%提高到98.2,对于泰米尔语(+12.9%的相对变化)和Sinhala语(+3.26%的相对变化)来说,在作者认为具有挑战性的数据集上,我们采用的模型检测代码组合文本、数字和特殊字符的方法可以提高Tesseract 4.1.1的性精度,从85.5%提高到98.2,对于泰米尔语(+12.9%的相对变化)和Sinhala(+3.26%的相对变化)到94.8至94.8。

0

相关内容

文字识别

利用计算机自动识别字符的技术，是模式识别应用的一个重要领域。人们在生产和生活中，要处理大量的文字、报表和文本。为了减轻人们的劳动，提高处理效率，50年代开始探讨一般文字识别方法，并研制出光学字符识别器。60年代出现了采用磁性墨水和特殊字体的实用机器。60年代后期，出现了多种字体和手写体文字识别机，其识别精度和机器性能都基本上能满足要求。如用于信函分拣的手写体数字识别机和印刷体英文数字识别机。70年代主要研究文字识别的基本理论和研制高性能的文字识别机，并着重于汉字识别的研究。

知识荟萃

精品入门和进阶教程、论文和代码整理等

更多

查看相关VIP内容、论文、资讯等

零样本文本分类，Zero-Shot Learning for Text Classification

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

【2020关键词提取】基于深度神经网络的关键词提取，Keywords extraction with deep neural network model

【2020关键词提取】基于深度神经网络的关键词提取，Keywords extraction with deep neural network model

专知会员服务

60+阅读 · 2020年5月2日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【Yoshua Bengio新论文】多任务自监督学习语音识别，MULTI-TASK SELF-SUPERVISED LEARNING FOR ROBUST SPEECH RECOGNITION

【Yoshua Bengio新论文】多任务自监督学习语音识别，MULTI-TASK SELF-SUPERVISED LEARNING FOR ROBUST SPEECH RECOGNITION

专知会员服务

39+阅读 · 2020年1月30日

【AAAI2020】知识图谱的生成式对抗零样本关系学习，Generative Adversarial Zero-Shot Relational Learning for Knowledge Graphs

【AAAI2020】知识图谱的生成式对抗零样本关系学习，Generative Adversarial Zero-Shot Relational Learning for Knowledge Graphs

专知会员服务

64+阅读 · 2020年1月11日

【Google 76分钟训练万BERT最新论文】Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

【Google 76分钟训练万BERT最新论文】Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

专知会员服务

4+阅读 · 2020年1月7日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

已删除

将门创投

14+阅读 · 2019年5月29日

Call for Participation: Shared Tasks in NLPCC 2019

Call for Participation: Shared Tasks in NLPCC 2019

中国计算机学会

5+阅读 · 2019年3月22日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

【论文推荐】最新五篇命名实体识别相关论文—深度主动学习、Lattice LSTM、混合马尔可夫CRF

【论文推荐】最新五篇命名实体识别相关论文—深度主动学习、Lattice LSTM、混合马尔可夫CRF

专知

26+阅读 · 2018年5月22日

FocusFace: Multi-task Contrastive Learning for Masked Face Recognition

Arxiv

0+阅读 · 2021年11月1日

Characterizing Impacts of Heterogeneity in Federated Learning upon Large-Scale Smartphone Data

Arxiv

12+阅读 · 2021年2月21日

Text-to-Image Synthesis Based on Machine Generated Captions

Text-to-Image Synthesis Based on Machine Generated Captions

Arxiv

3+阅读 · 2019年10月9日

Efficient, Lexicon-Free OCR using Deep Learning

Arxiv

3+阅读 · 2019年6月5日

Open Set Chinese Character Recognition using Multi-typed Attributes

Open Set Chinese Character Recognition using Multi-typed Attributes

Arxiv

4+阅读 · 2018年8月27日

Character-Level Feature Extraction with Densely Connected Networks

Character-Level Feature Extraction with Densely Connected Networks

Arxiv

5+阅读 · 2018年7月26日

Chinese NER Using Lattice LSTM

Arxiv

14+阅读 · 2018年5月15日

Cross-type Biomedical Named Entity Recognition with Deep Multi-Task Learning

Arxiv

3+阅读 · 2018年4月5日

Adversarial Learning for Chinese NER from Crowd Annotations

Arxiv

15+阅读 · 2018年1月16日

Train Once, Test Anywhere: Zero-Shot Learning for Text Classification

Arxiv

4+阅读 · 2017年12月23日

VIP会员

文章信息

相关主题

主成分回归

光学字符识别

相关VIP内容

零样本文本分类，Zero-Shot Learning for Text Classification

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

【2020关键词提取】基于深度神经网络的关键词提取，Keywords extraction with deep neural network model

【2020关键词提取】基于深度神经网络的关键词提取，Keywords extraction with deep neural network model

专知会员服务

60+阅读 · 2020年5月2日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【Yoshua Bengio新论文】多任务自监督学习语音识别，MULTI-TASK SELF-SUPERVISED LEARNING FOR ROBUST SPEECH RECOGNITION

【Yoshua Bengio新论文】多任务自监督学习语音识别，MULTI-TASK SELF-SUPERVISED LEARNING FOR ROBUST SPEECH RECOGNITION

专知会员服务

39+阅读 · 2020年1月30日

【AAAI2020】知识图谱的生成式对抗零样本关系学习，Generative Adversarial Zero-Shot Relational Learning for Knowledge Graphs

【AAAI2020】知识图谱的生成式对抗零样本关系学习，Generative Adversarial Zero-Shot Relational Learning for Knowledge Graphs

专知会员服务

64+阅读 · 2020年1月11日

【Google 76分钟训练万BERT最新论文】Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

【Google 76分钟训练万BERT最新论文】Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

专知会员服务

4+阅读 · 2020年1月7日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

热门VIP内容

开通专知VIP会员享更多权益服务

大语言模型驱动的AI智能体通信综述：协议、安全风险与防御对策

基于BERT和知识图谱的武器装备问答系统

中文版4000字 | 战场人工智能革命尚未到来：当前俄乌AI无人机发展现状

中文版 | 转向防务：硅谷如何谋划接管战争产业

相关资讯

已删除

将门创投

14+阅读 · 2019年5月29日

Call for Participation: Shared Tasks in NLPCC 2019

Call for Participation: Shared Tasks in NLPCC 2019

中国计算机学会

5+阅读 · 2019年3月22日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

【论文推荐】最新五篇命名实体识别相关论文—深度主动学习、Lattice LSTM、混合马尔可夫CRF

【论文推荐】最新五篇命名实体识别相关论文—深度主动学习、Lattice LSTM、混合马尔可夫CRF

专知

26+阅读 · 2018年5月22日

相关论文

FocusFace: Multi-task Contrastive Learning for Masked Face Recognition

Arxiv

0+阅读 · 2021年11月1日

Characterizing Impacts of Heterogeneity in Federated Learning upon Large-Scale Smartphone Data

Arxiv

12+阅读 · 2021年2月21日

Text-to-Image Synthesis Based on Machine Generated Captions

Text-to-Image Synthesis Based on Machine Generated Captions

Arxiv

3+阅读 · 2019年10月9日

Efficient, Lexicon-Free OCR using Deep Learning

Arxiv

3+阅读 · 2019年6月5日

Open Set Chinese Character Recognition using Multi-typed Attributes

Open Set Chinese Character Recognition using Multi-typed Attributes

Arxiv

4+阅读 · 2018年8月27日

Character-Level Feature Extraction with Densely Connected Networks

Character-Level Feature Extraction with Densely Connected Networks

Arxiv

5+阅读 · 2018年7月26日

Chinese NER Using Lattice LSTM

Arxiv

14+阅读 · 2018年5月15日

Cross-type Biomedical Named Entity Recognition with Deep Multi-Task Learning

Arxiv

3+阅读 · 2018年4月5日

Adversarial Learning for Chinese NER from Crowd Annotations

Arxiv

15+阅读 · 2018年1月16日

Train Once, Test Anywhere: Zero-Shot Learning for Text Classification

Arxiv

4+阅读 · 2017年12月23日

微信扫码咨询专知VIP会员