We aim to investigate the performance of current OCR systems on low resource languages and low resource scripts. We introduce and make publicly available a novel benchmark, OCR4MT, consisting of real and synthetic data, enriched with noise, for 60 low-resource languages in low resource scripts. We evaluate state-of-the-art OCR systems on our benchmark and analyse most common errors. We show that OCR monolingual data is a valuable resource that can increase performance of Machine Translation models, when used in backtranslation. We then perform an ablation study to investigate how OCR errors impact Machine Translation performance and determine what is the minimum level of OCR quality needed for the monolingual data to be useful for Machine Translation.
翻译:我们的目标是调查目前关于低资源语言和资源文字的OCR系统的业绩。我们为60种低资源文字的低资源文字引入并公布由真实和合成数据组成的新的基准OCR4MT, 其中包括以噪音充实的、由真实和合成数据组成的、由60种低资源文字组成的新的基准。我们根据我们的基准评估最先进的OCR系统,并分析最常见的错误。我们表明OCR单语数据是一种宝贵的资源,可以提高机器翻译模型的性能,如果用于反译的话。我们随后进行一项模拟研究,以调查OCR错误如何影响机器翻译的性能,并确定单语言数据对机器翻译有用所需的OCR质量的最低水平。