The task of phone-to-audio alignment has many applications in speech research. Here we introduce two Wav2Vec2-based models for both text-dependent and text-independent phone-to-audio alignment. The proposed Wav2Vec2-FS, a semi-supervised model, directly learns phone-to-audio alignment through contrastive learning and a forward sum loss, and can be coupled with a pretrained phone recognizer to achieve text-independent alignment. The other model, Wav2Vec2-FC, is a frame classification model trained on forced aligned labels that can both perform forced alignment and text-independent segmentation. Evaluation results suggest that both proposed methods, even when transcriptions are not available, generate highly close results to existing forced alignment tools. Our work presents a neural pipeline of fully automated phone-to-audio alignment. Code and pretrained models are available at https://github.com/lingjzhu/charsiu.
翻译:在语音研究中,电话到电话的对接任务有许多应用。 在这里, 我们引入了两种基于 Wav2Vec2 的文本依赖型和文本独立的电话到语音的对接模式。 拟议的Wav2Vec2- FS(半监督模式), 通过对比学习直接学习电话到话的对接, 以及前期总损失, 并且可以与一个经过预先培训的电话识别器同时实现文本独立的对接。 另一个模式, Wav2Vec2- FC( Wav2Vec2-FC), 是一个框架分类模式, 既能进行强制对接, 也可以进行文本独立的对接。 评价结果表明, 这两种拟议方法, 即使没有记录, 也会对现有强制对接线工具产生非常接近的结果。 我们的工作展示了完全自动的电话到话对话的对话连接神经管道。 代码和经过预先培训的模型可以在 https://github.com/lingjzhu/charsiu上查阅 。