哈萨克语、俄语和英语多语语、俄语和英语语言、端至端语言语音识别研究 (A Study of Multilingual End-to-End Speech Recognition for Kazakh, Russian, and English)

We study training a single end-to-end (E2E) automatic speech recognition (ASR) model for three languages used in Kazakhstan: Kazakh, Russian, and English. We first describe the development of multilingual E2E ASR based on Transformer networks and then perform an extensive assessment on the aforementioned languages. We also compare two variants of output grapheme set construction: combined and independent. Furthermore, we evaluate the impact of LMs and data augmentation techniques on the recognition performance of the multilingual E2E ASR. In addition, we present several datasets for training and evaluation purposes. Experiment results show that the multilingual models achieve comparable performances to the monolingual baselines with a similar number of parameters. Our best monolingual and multilingual models achieved 20.9% and 20.5% average word error rates on the combined test set, respectively. To ensure the reproducibility of our experiments and results, we share our training recipes, datasets, and pre-trained models.

翻译：我们研究哈萨克斯坦三种语言:哈萨克语、俄语和英语的单一端对端自动语音识别(E2E)模型。我们首先描述基于变换器网络的多语言E2E ASR开发情况,然后对上述语言进行广泛评估。我们还比较了产出图形集结构的两个变式:合并和独立。此外,我们评估LM和数据增强技术对多语言E2E ASR的识别性能的影响。此外,我们为培训和评估目的提供了几个数据集。实验结果表明,多语言模型取得了与单一语言基线相似的类似性能。我们最好的单语言和多语言模型在综合测试集中分别实现了20.9%和20.5%的平均字差率。为了确保我们实验和结果的可复制性,我们分享了我们的培训食谱、数据集和预先培训的模式。