无文本的电话对电话对音对接:半监督办法 (Phone-to-audio alignment without text: A Semi-supervised Approach)

The task of phone-to-audio alignment has many applications in speech research. Here we introduce two Wav2Vec2-based models for both text-dependent and text-independent phone-to-audio alignment. The proposed Wav2Vec2-FS, a semi-supervised model, directly learns phone-to-audio alignment through contrastive learning and a forward sum loss, and can be coupled with a pretrained phone recognizer to achieve text-independent alignment. The other model, Wav2Vec2-FC, is a frame classification model trained on forced aligned labels that can both perform forced alignment and text-independent segmentation. Evaluation results suggest that both proposed methods, even when transcriptions are not available, generate highly close results to existing forced alignment tools. Our work presents a neural pipeline of fully automated phone-to-audio alignment. Code and pretrained models are available at https://github.com/lingjzhu/charsiu.

翻译：在语音研究中,电话到电话的对接任务有许多应用。在这里, 我们引入了两种基于 Wav2Vec2 的文本依赖型和文本独立的电话到语音的对接模式。拟议的Wav2Vec2- FS(半监督模式), 通过对比学习直接学习电话到话的对接, 以及前期总损失, 并且可以与一个经过预先培训的电话识别器同时实现文本独立的对接。另一个模式, Wav2Vec2- FC( Wav2Vec2-FC), 是一个框架分类模式, 既能进行强制对接, 也可以进行文本独立的对接。评价结果表明, 这两种拟议方法, 即使没有记录, 也会对现有强制对接线工具产生非常接近的结果。我们的工作展示了完全自动的电话到话对话的对话连接神经管道。代码和经过预先培训的模型可以在 https://github.com/lingjzhu/charsiu上查阅。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【Hugging Face】使用自定义数据集微调语义分割模型，Fine-Tune a Semantic Segmentation Model with a Custom Dataset

专知会员服务

21+阅读 · 2022年3月18日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

CVPR 2020 论文开源项目合集

专知会员服务

110+阅读 · 2020年3月12日