Diffusion-Link：基于扩散概率模型弥合音频-文本模态鸿沟 (Diffusion-Link: Diffusion Probabilistic Model for Bridging the Audio-Text Modality Gap)

Contrastive audio-language pretraining yields powerful joint representations, yet a persistent audio-text modality gap limits the benefits of coupling multimodal encoders with large language models (LLMs). We present Diffusion-Link, a diffusion-based modality-bridging module that generatively maps audio embeddings into the text-embedding distribution. The module is trained at the output embedding from the frozen multimodal encoder and implemented as a lightweight network with three residual MLP blocks. To assess the effect of Diffusion-Link on multimodal encoder-LLM coupling, we evaluate on Automatic Audio Captioning (AAC); to our knowledge, this is the first application of diffusion-based modality bridging to AAC. We report two results. (1) Modality-gap analysis: on similarity and geometric criteria, Diffusion-Link reduces the modality gap the most among prior diffusion-based methods and shows a collective migration of audio embeddings toward the text distribution. (2) Downstream AAC: attaching Diffusion-Link to the same multimodal LLM baseline achieves state-of-the-art on AudioCaps in both zero-shot and fully supervised captioning without external knowledge, with relative gains up to 52.5% and 7.5%, respectively. These findings show that closing the modality gap is pivotal for effective coupling between multimodal encoders and LLMs, and diffusion-based modality bridging offers a promising direction beyond knowledge-retrieval-centric designs. Code will be released upon acceptance https://github.com/DevKiHyun/Diffusion-Link

翻译：对比式音频-语言预训练能够产生强大的联合表征，然而持续的音频-文本模态鸿沟限制了将多模态编码器与大型语言模型（LLMs）耦合的效益。本文提出Diffusion-Link，一种基于扩散的模态桥接模块，通过生成式映射将音频嵌入转换至文本嵌入分布。该模块在冻结多模态编码器的输出嵌入层进行训练，采用包含三个残差MLP块的轻量级网络实现。为评估Diffusion-Link对多模态编码器-LLM耦合的影响，我们在自动音频描述（AAC）任务上进行评测；据我们所知，这是首次将基于扩散的模态桥接技术应用于AAC任务。我们报告两项结果：（1）模态鸿沟分析：在相似性与几何度量标准下，Diffusion-Link在现有基于扩散的方法中最大程度地缩小了模态鸿沟，并显示出音频嵌入向文本分布集体迁移的现象。（2）下游AAC任务：将Diffusion-Link接入相同的多模态LLM基线模型后，在AudioCaps数据集上实现了零样本和全监督描述任务的性能最优结果（未使用外部知识），相对提升分别达到52.5%和7.5%。这些发现表明，缩小模态鸿沟对于实现多模态编码器与LLMs的有效耦合至关重要，而基于扩散的模态桥接技术为超越以知识检索为中心的设计提供了新方向。代码将在论文录用后发布于 https://github.com/DevKiHyun/Diffusion-Link