可伸缩的手写文本识别系统：针对欠资源语言和字母的词典来源 (Scalable handwritten text recognition system for lexicographic sources of under-resourced languages and alphabets) - 专知论文

会员服务 ·

0

文本识别 · 识别系统 · 识别 · 变换 · 合成 ·

2023 年 3 月 28 日

Scalable handwritten text recognition system for lexicographic sources of under-resourced languages and alphabets

翻译：可伸缩的手写文本识别系统：针对欠资源语言和字母的词典来源

Jan Idziak,Artjoms Šeļa,Michał Woźniak,Albert Leśniak,Joanna Byszuk,Maciej Eder

The paper discusses an approach to decipher large collections of handwritten index cards of historical dictionaries. Our study provides a working solution that reads the cards, and links their lemmas to a searchable list of dictionary entries, for a large historical dictionary entitled the Dictionary of the 17th- and 18th-century Polish, which comprizes 2.8 million index cards. We apply a tailored handwritten text recognition (HTR) solution that involves (1) an optimized detection model; (2) a recognition model to decipher the handwritten content, designed as a spatial transformer network (STN) followed by convolutional neural network (RCNN) with a connectionist temporal classification layer (CTC), trained using a synthetic set of 500,000 generated Polish words of different length; (3) a post-processing step using constrained Word Beam Search (WBC): the predictions were matched against a list of dictionary entries known in advance. Our model achieved the accuracy of 0.881 on the word level, which outperforms the base RCNN model. Within this study we produced a set of 20,000 manually annotated index cards that can be used for future benchmarks and transfer learning HTR applications.

翻译：本文讨论了一种解密历史词典大量手写索引卡的方法。我们提供了一个工作解决方案，能够读取这些卡片，并将它们的引文链接到一个可搜索的词典条目列表中。这个大型历史词典名为《17和18世纪波兰词典》（Dictionary of the 17th- and 18th-century Polish），包含280万张索引卡。我们采用了一种定制的手写文本识别（HTR，Handwritten Text Recognition）解决方案，包括：（1）一个优化的检测模型；（2）用于解密手写内容的识别模型，设计为一个空间变换网络（STN，Spatial Transformer Network）后跟卷积神经网络（RCNN），其在一个由50万个不同长度的合成波兰单词组成的合成数据集上进行训练，训练后使用连接主义时间分类（CTC）层；（3）使用受限的Word Beam Search（WBC）进行后处理：将预测值与事先已知的词典条目列表进行匹配。我们的模型在单词级别上实现了0.881的准确率，优于基本的RCNN模型。在这项研究中，我们制作了一组手工注释的2万张索引卡，可用于未来的基准测试和转移学习HTR应用程序。

0

相关内容

文本识别

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

【CMU-TACL2020】低资源跨语言实体链接，Low-resource Crosslingual EntityLinking

专知会员服务

17+阅读 · 2020年3月29日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

专知会员服务

28+阅读 · 2020年2月12日

【论文推荐】Short Text Classiﬁcation via Term Graph 基于术语图的短文本分类

【论文推荐】Short Text Classiﬁcation via Term Graph 基于术语图的短文本分类

专知会员服务

20+阅读 · 2020年1月20日

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

专知会员服务

15+阅读 · 2019年10月23日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

181+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

论文清单：一文梳理因果推理在自然语言处理中的应用

论文清单：一文梳理因果推理在自然语言处理中的应用

PaperWeekly

1+阅读 · 2022年9月7日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

超强合集：OCR文本检测干货汇总（含论文、源码、demo等资源）

超强合集：OCR文本检测干货汇总（含论文、源码、demo等资源）

极市平台

33+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

上百份文字的检测与识别资源，包含数据集、code和paper

上百份文字的检测与识别资源，包含数据集、code和paper

数据挖掘入门与实战

17+阅读 · 2017年12月7日

【ACM MM论文集】国际多媒体顶级会议ACM Multimedia 2017 Open Access Repository

【ACM MM论文集】国际多媒体顶级会议ACM Multimedia 2017 Open Access Repository

专知

13+阅读 · 2017年10月17日

自然语言处理 (NLP)资源大全

自然语言处理 (NLP)资源大全

机械鸡

35+阅读 · 2017年9月17日

【推荐】用Tensorflow理解LSTM

【推荐】用Tensorflow理解LSTM

机器学习研究会

36+阅读 · 2017年9月11日

【推荐】RNN/LSTM时序预测

【推荐】RNN/LSTM时序预测

机器学习研究会

25+阅读 · 2017年9月8日

信息不完全的双边匹配决策方法研究

国家自然科学基金

2+阅读 · 2015年12月31日

基于格值逻辑的语言真值α-群锁语义归结自动推理研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于规则学习汉语语义构词研究

国家自然科学基金

1+阅读 · 2012年12月31日

Perp在类风湿性关节炎外周Th17细胞存活中的作用研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于Ontology的藏文语料库检索关键技术研究

国家自然科学基金

0+阅读 · 2012年12月31日

面向Web文本的属性和属性值知识获取方法研究

国家自然科学基金

1+阅读 · 2012年12月31日

基于图像信息隐藏的卫星数据传输理论与方法

国家自然科学基金

1+阅读 · 2012年12月31日

面向Deep Web的大规模知识库自动构建方法研究

国家自然科学基金

4+阅读 · 2011年12月31日

面向Web的大规模社会网络数据提取理论与方法研究

国家自然科学基金

2+阅读 · 2011年12月31日

面向大规模RDF数据的分布式处理技术研究

国家自然科学基金

0+阅读 · 2009年12月31日

Extending Memory for Language Modelling

Arxiv

0+阅读 · 2023年5月19日

Post Hoc Explanations of Language Models Can Improve Language Models

Arxiv

0+阅读 · 2023年5月19日

FunASR: A Fundamental End-to-End Speech Recognition Toolkit

Arxiv

0+阅读 · 2023年5月18日

OpenSLU: A Unified, Modularized, and Extensible Toolkit for Spoken Language Understanding

Arxiv

0+阅读 · 2023年5月17日

WordStylist: Styled Verbatim Handwritten Text Generation with Latent Diffusion Models

Arxiv

0+阅读 · 2023年5月17日

Causal Inference Principles for Reasoning about Commonsense Causality

Arxiv

13+阅读 · 2022年1月31日

SVT-Net: Super Light-Weight Sparse Voxel Transformer for Large Scale Place Recognition

Arxiv

12+阅读 · 2021年5月30日

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

Arxiv

12+阅读 · 2020年2月19日

Scene Text Detection and Recognition: The Deep Learning Era

Scene Text Detection and Recognition: The Deep Learning Era

Arxiv

27+阅读 · 2019年9月5日

BERT for Joint Intent Classification and Slot Filling

Arxiv

12+阅读 · 2019年2月28日

VIP会员

文章信息

相关主题

相关VIP内容

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

【CMU-TACL2020】低资源跨语言实体链接，Low-resource Crosslingual EntityLinking

专知会员服务

17+阅读 · 2020年3月29日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

专知会员服务

28+阅读 · 2020年2月12日

【论文推荐】Short Text Classiﬁcation via Term Graph 基于术语图的短文本分类

【论文推荐】Short Text Classiﬁcation via Term Graph 基于术语图的短文本分类

专知会员服务

20+阅读 · 2020年1月20日

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

专知会员服务

15+阅读 · 2019年10月23日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

181+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

新质生成式AI赋能产业变革的实践与路径

用于多模态大模型的离散标记化：全面综述

Nature综述：金融网络中的物理学

【CMU博士论文】通信高效且差分隐私的优化方法

相关资讯

论文清单：一文梳理因果推理在自然语言处理中的应用

论文清单：一文梳理因果推理在自然语言处理中的应用

PaperWeekly

1+阅读 · 2022年9月7日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

超强合集：OCR文本检测干货汇总（含论文、源码、demo等资源）

超强合集：OCR文本检测干货汇总（含论文、源码、demo等资源）

极市平台

33+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

上百份文字的检测与识别资源，包含数据集、code和paper

上百份文字的检测与识别资源，包含数据集、code和paper

数据挖掘入门与实战

17+阅读 · 2017年12月7日

【ACM MM论文集】国际多媒体顶级会议ACM Multimedia 2017 Open Access Repository

【ACM MM论文集】国际多媒体顶级会议ACM Multimedia 2017 Open Access Repository

专知

13+阅读 · 2017年10月17日

自然语言处理 (NLP)资源大全

自然语言处理 (NLP)资源大全

机械鸡

35+阅读 · 2017年9月17日

【推荐】用Tensorflow理解LSTM

【推荐】用Tensorflow理解LSTM

机器学习研究会

36+阅读 · 2017年9月11日

【推荐】RNN/LSTM时序预测

【推荐】RNN/LSTM时序预测

机器学习研究会

25+阅读 · 2017年9月8日

相关论文

Extending Memory for Language Modelling

Arxiv

0+阅读 · 2023年5月19日

Post Hoc Explanations of Language Models Can Improve Language Models

Arxiv

0+阅读 · 2023年5月19日

FunASR: A Fundamental End-to-End Speech Recognition Toolkit

Arxiv

0+阅读 · 2023年5月18日

OpenSLU: A Unified, Modularized, and Extensible Toolkit for Spoken Language Understanding

Arxiv

0+阅读 · 2023年5月17日

WordStylist: Styled Verbatim Handwritten Text Generation with Latent Diffusion Models

Arxiv

0+阅读 · 2023年5月17日

Causal Inference Principles for Reasoning about Commonsense Causality

Arxiv

13+阅读 · 2022年1月31日

SVT-Net: Super Light-Weight Sparse Voxel Transformer for Large Scale Place Recognition

Arxiv

12+阅读 · 2021年5月30日

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

Arxiv

12+阅读 · 2020年2月19日

Scene Text Detection and Recognition: The Deep Learning Era

Scene Text Detection and Recognition: The Deep Learning Era

Arxiv

27+阅读 · 2019年9月5日

BERT for Joint Intent Classification and Slot Filling

Arxiv

12+阅读 · 2019年2月28日

相关基金

信息不完全的双边匹配决策方法研究

国家自然科学基金

2+阅读 · 2015年12月31日

基于格值逻辑的语言真值α-群锁语义归结自动推理研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于规则学习汉语语义构词研究

国家自然科学基金

1+阅读 · 2012年12月31日

Perp在类风湿性关节炎外周Th17细胞存活中的作用研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于Ontology的藏文语料库检索关键技术研究

国家自然科学基金

0+阅读 · 2012年12月31日

面向Web文本的属性和属性值知识获取方法研究

国家自然科学基金

1+阅读 · 2012年12月31日

基于图像信息隐藏的卫星数据传输理论与方法

国家自然科学基金

1+阅读 · 2012年12月31日

面向Deep Web的大规模知识库自动构建方法研究

国家自然科学基金

4+阅读 · 2011年12月31日

面向Web的大规模社会网络数据提取理论与方法研究

国家自然科学基金

2+阅读 · 2011年12月31日

面向大规模RDF数据的分布式处理技术研究

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员