We propose spoken sentence embeddings which capture both acoustic and linguistic content. While existing works operate at the character, phoneme, or word level, our method learns long-term dependencies by modeling speech at the sentence level. Formulated as an audio-linguistic multitask learning problem, our encoder-decoder model simultaneously reconstructs acoustic and natural language features from audio. Our results show that spoken sentence embeddings outperform phoneme and word-level baselines on speech recognition and emotion recognition tasks. Ablation studies show that our embeddings can better model high-level acoustic concepts while retaining linguistic content. Overall, our work illustrates the viability of generic, multi-modal sentence embeddings for spoken language understanding.
翻译:我们建议包含语音和语言内容的语音句内嵌。 虽然现有作品在字符、电话或文字层面运作,但我们的方法通过在句级上建模演讲来学习长期依赖性。 我们的编码器解码器模型是一个语言语言多任务学习问题, 同时从音频中重建音频和自然语言特征。 我们的结果表明,语言句内嵌的语音和字级基线超越语音识别和情感识别任务。 减法研究表明,我们的嵌入可以更好地建模高层次的声学概念,同时保留语言内容。 总之,我们的工作展示了通用的、多模式的句内嵌用于口头语言理解的可行性。