Embeddings play an important role in end-to-end solutions for multi-modal language processing problems. Although there has been some effort to understand the properties of single-modality embedding spaces, particularly that of text, their cross-modal counterparts are less understood. In this work, we study some intrinsic properties of a joint speech-text embedding space, constructed by minimizing the distance between paired utterance and transcription inputs in a teacher-student model setup, that are informative for several prominent use cases. We found that incorporating automatic speech recognition through both pretraining and multitask scenarios aid semantic alignment significantly, resulting in more tightly coupled embeddings. To analyse cross-modal embeddings we utilise a quantitative retrieval accuracy metric for semantic alignment, zero-shot classification for generalisability, and probing of the encoders to observe the extent of knowledge transfer from one modality to another.
翻译:在多式语言处理问题的端到端解决方案中,嵌入式在多式语言处理问题中起着重要作用。虽然为理解单式模式嵌入空间的特性,特别是文本嵌入空间的特性做出了一些努力,但其交叉式嵌入空间的特性却不太为人理解。在这项工作中,我们研究了联合语音文本嵌入空间的一些内在特性,该空间的构建方式是将配对语句和抄录投入之间的距离最小化,这为几个突出的使用案例提供了信息。我们发现,通过培训前和多式任务情景的对齐,将自动语音识别极大地纳入了语句识别,从而导致更紧密地结合嵌入。为了分析跨式模式嵌入,我们使用了一个量化检索精确度指标,用于语义对齐,对通用性进行零光分解,并测试编码器以观察从一种模式向另一种模式转移知识的程度。