Despite recent advancements in deep learning technologies, Child Speech Recognition remains a challenging task. Current Automatic Speech Recognition (ASR) models require substantial amounts of annotated data for training, which is scarce. In this work, we explore using the ASR model, wav2vec2, with different pretraining and finetuning configurations for self-supervised learning (SSL) toward improving automatic child speech recognition. The pretrained wav2vec2 models were finetuned using different amounts of child speech training data, adult speech data, and a combination of both, to discover the optimum amount of data required to finetune the model for the task of child ASR. Our trained model achieves the best Word Error Rate (WER) of 7.42 on the MyST child speech dataset, 2.99 on the PFSTAR dataset and 12.47 on the CMU KIDS dataset as compared to any other previous methods. Our models outperformed the wav2vec2 BASE 960 on child speech which is considered a state-of-the-art ASR model on adult speech by just using 10 hours of child speech data in finetuning. The analysis of different types of training data and their effect on inference is also provided by using a combination of datasets in pretraining, finetuning and inference.
翻译:尽管在深层学习技术方面最近有所进步,但儿童语音识别仍然是一项艰巨的任务。目前,自动语音识别模式需要大量的附加说明的培训数据,而这种数据很少。在这项工作中,我们探索使用ASR模型, wav2vec2, 其自我监督学习的预训和微调配置不同,目的是提高儿童语音识别的自动。预先培训的 wav2vec2模型使用不同数量的儿童语音培训数据、成人语音数据以及两者的组合进行微调,以发现微调儿童语音任务模型所需的最佳数据数量。我们所培训的模型在 MyST 儿童语音数据集上实现了7.42 最佳词错误率(WER ), 在PFSTAR数据集上实现了2.99, 在 CMU KIDS数据集上实现了12.47, 与任何其他以往方法相比也是如此。 我们的模型利用10小时的儿童语音识别数据进行了微调,并且通过在微调组合中提供不同类型数据调整前数据分析。