Recent advancements in neural audio codecs have not only enabled superior audio compression but also enhanced speech synthesis techniques. Researchers are now exploring their potential as universal acoustic feature extractors for a broader range of speech processing tasks. Building on this trend, we introduce Codec2Vec, the first speech representation learning framework that relies exclusively on discrete audio codec units. This approach offers several advantages, including improved data storage and transmission efficiency, faster training, and enhanced data privacy. We explore masked prediction with various training target derivation strategies to thoroughly understand the effectiveness of this framework. Evaluated on the SUPERB benchmark, Codec2Vec achieves competitive performance compared to continuous-input models while reducing storage requirements by up to 16.5x and training time by 2.3x, showcasing its scalability and efficiency.
翻译:神经音频编解码器的最新进展不仅实现了卓越的音频压缩性能,还推动了语音合成技术的提升。研究者正探索其作为通用声学特征提取器在更广泛语音处理任务中的潜力。基于这一趋势,我们提出了Codec2Vec——首个完全基于离散音频编解码单元的语音表征学习框架。该方法具有多重优势,包括提升数据存储与传输效率、加速训练过程以及增强数据隐私性。我们通过多种训练目标派生策略的掩码预测任务,深入探究该框架的有效性。在SUPERB基准测试中,Codec2Vec在保持与连续输入模型相当性能的同时,将存储需求降低至最高16.5倍,训练时间缩短2.3倍,充分展现了其可扩展性与高效性。