Recent methods in speech and language technology pretrain very LARGE models which are fine-tuned for specific tasks. However, the benefits of such LARGE models are often limited to a few resource rich languages of the world. In this work, we make multiple contributions towards building ASR systems for low resource languages from the Indian subcontinent. First, we curate 17,000 hours of raw speech data for 40 Indian languages from a wide variety of domains including education, news, technology, and finance. Second, using this raw speech data we pretrain several variants of wav2vec style models for 40 Indian languages. Third, we analyze the pretrained models to find key features: codebook vectors of similar sounding phonemes are shared across languages, representations across layers are discriminative of the language family, and attention heads often pay attention within small local windows. Fourth, we fine-tune this model for downstream ASR for 9 languages and obtain state-of-the-art results on 3 public datasets, including on very low-resource languages such as Sinhala and Nepali. Our work establishes that multilingual pretraining is an effective strategy for building ASR systems for the linguistically diverse speakers of the Indian subcontinent.
翻译:语言和语言技术前程的最近方法为具体任务作了微调,但是,这种LARGE模式的好处往往局限于世界上少数资源丰富的语言。在这项工作中,我们为印度次大陆低资源语言建立ASR系统作出了多种贡献。首先,我们从教育、新闻、技术和金融等广泛领域为40种印度语言汇编了17 000小时原始语言数据。第二,利用这些原始语言数据,我们预先为40种印度语言的 wav2vec风格模式的几种变异版本作了准备。第三,我们分析了预先培训的模式,以找到关键特征:类似声音电话的代码簿矢量在各种语言之间共享,跨层次的表达方式是语言大家庭的区别,注意力在小地方窗口中往往引起注意。第四,我们为下游语言的ASR模型做了9种语言的微调,并在3个公共数据集(包括Sinhala和Nepalii等非常低资源的语言)上获得了最新的结果。我们的工作确定,多语言的预演练是建立印度语言多样性的ASR系统的有效战略。