Self- and semi-supervised learning methods have been actively investigated to reduce labeled training data or enhance the model performance. However, the approach mostly focus on in-domain performance for public datasets. In this study, we utilize the combination of self- and semi-supervised learning methods to solve unseen domain adaptation problem in a large-scale production setting for online ASR model. This approach demonstrates that using the source domain data with a small fraction of the target domain data (3%) can recover the performance gap compared to a full data baseline: relative 13.5% WER improvement for target domain data.
翻译:积极调查了自我和半监督的学习方法,以减少有标签的培训数据,或提高模型性能。但是,这种方法主要侧重于公共数据集的内部性能。在本研究中,我们利用自我和半监督的学习方法相结合,在网上ASR模型的大规模生产环境中解决无形域适应问题。这种方法表明,使用目标域数据中一小部分(3%)的源域数据可以恢复与完整数据基线的性能差距:目标域数据相对13.5%的WER改进。