Voice conversion (VC) could be used to improve speech recognition systems in low-resource languages by using it to augment limited training data. However, VC has not been widely used for this purpose because of practical issues such as compute speed and limitations when converting to and from unseen speakers. Moreover, it is still unclear whether a VC model trained on one well-resourced language can be applied to speech from another low-resource language for the aim of data augmentation. In this work we assess whether a VC system can be used cross-lingually to improve low-resource speech recognition. We combine several recent techniques to design and train a practical VC system in English, and then use this system to augment data for training speech recognition models in several low-resource languages. When using a sensible amount of VC augmented data, speech recognition performance is improved in all four low-resource languages considered. We also show that VC-based augmentation is superior to SpecAugment (a widely used signal processing augmentation method) in the low-resource languages considered.
翻译:语音转换(VC)可用于改进低资源语言的语音识别系统,方法是利用它来增加有限的培训数据。然而,由于在转换到和从隐蔽语言者转换到隐蔽语言时计算速度和限制等实际问题,VC尚未被广泛用于这一目的。此外,目前尚不清楚的是,为增强数据的目的,是否可以将受过一种资源丰富的语言培训的VC模式用于另一种低资源语言的语音识别系统。在这项工作中,我们评估是否可以使用跨语言的VC系统来改进低资源语言的识别。我们结合了最近的一些技术来设计和培训一个实用的VC系统,然后利用这个系统来增加培训几种低资源语言的语音识别模型的数据。在使用合理数量的VC增强数据时,考虑的所有四种低资源语言的语音识别表现都得到了改进。我们还表明,基于VC的增强功能优于所考虑的低资源语言的SpecAugment(一种广泛使用的信号处理增强方法)。