Most of the low resource languages do not have the necessary resources to create even a substantial monolingual corpus. These languages may often be found in government proceedings but mostly in the form of Portable Document Formats (PDFs) that contains legacy fonts. Extracting text from these documents to create a monolingual corpus is challenging due to legacy font usage and printer-friendly encoding which are not optimized for text extraction. Therefore, we propose a simple, automatic, and novel idea that can scale for Tamil, Sinhala, and English languages and many documents. For this purpose, we enhanced the performance of Tesseract 4.1.1 by employing LSTM-based training on many legacy fonts to recognize printed characters in the above languages. Especially, our model detects code-mix text, numbers, and special characters from the printed document. It is shown that this approach can boost the character-level accuracy of Tesseract 4.1.1 from 85.5 to 98.2 for Tamil (+12.9% relative change) and 91.8 to 94.8 for Sinhala (+3.26% relative change) on a dataset that is considered as challenging by its authors.
翻译:大多数低资源语言都不具备必要的资源来创建即使是实质性的单语版。 这些语言通常在政府程序中找到,但大多以含有遗留字体的便携式文档格式(PDFs)的形式出现。从这些文件中提取文本以创建单语版本具有挑战性,因为传统的字体使用和打印机友好型编码对于文本提取来说并非最优化。因此,我们提出了一个简单、自动和新颖的理念,可以推广泰米尔语、僧伽罗语、英语和许多文件。为此,我们利用基于LSTM的许多遗留字体的LSTM培训,以识别上述语言中的印刷字符,加强了Tesseract 4.1.1的性格精度,从85.5%提高到98.2,对于泰米尔语(+12.9%的相对变化)和Sinhala语(+3.26%的相对变化)来说,在作者认为具有挑战性的数据集上,我们采用的模型检测代码组合文本、数字和特殊字符的方法可以提高Tesseract 4.1.1的性精度,从85.5%提高到98.2,对于泰米尔语(+12.9%的相对变化)和Sinhala(+3.26%的相对变化)到94.8至94.8。