Large language models (LLMs) have recently advanced auditory speech recognition (ASR), visual speech recognition (VSR), and audio-visual speech recognition (AVSR). However, understanding of their internal dynamics under fine-tuning remains limited. In natural language processing, recent work has revealed attention sinks, tokens that attract disproportionately high attention, and associated massive activations in which some features of sink tokens exhibit huge activation in LLMs. In this work, we are the first to study these phenomena in multimodal speech recognition. Through a detailed analysis of audio-visual LLMs, we identify attention sinks and massive activations not only at the BOS token but also at intermediate low-semantic tokens across ASR, VSR, and AVSR. We show that massive activations originate in the MLP layers and correspond to fixed feature indices across all sink tokens. We further show that intermediate sink tokens exhibit high cosine similarity to the BOS token, thereby amplifying attention and activation. Building on these insights, we introduce a simple decorrelation loss that reduces cosine similarity between BOS and other tokens, effectively mitigating intermediate sinks and massive activations. Furthermore, our method improves word error rate (WER) under high audio-visual feature downsampling while remaining stable at lower downsampling rates.
翻译:大型语言模型(LLM)近年来在听觉语音识别(ASR)、视觉语音识别(VSR)以及音频-视觉语音识别(AVSR)领域取得了显著进展。然而,对其在微调过程中内部动态的理解仍然有限。在自然语言处理领域,近期研究揭示了注意力汇聚现象——即某些标记会吸引不成比例的高注意力——以及与之相关的大规模激活现象,其中汇聚标记的某些特征在LLM中表现出巨大的激活值。本研究首次在多模态语音识别中探讨这些现象。通过对音频-视觉LLM的详细分析,我们不仅在起始标记(BOS)处,还在ASR、VSR和AVSR任务中的中间低语义标记处识别出注意力汇聚与大规模激活现象。我们发现大规模激活现象源于多层感知机(MLP)层,并且对应所有汇聚标记中固定的特征索引。我们进一步证明中间汇聚标记与BOS标记具有较高的余弦相似度,从而放大了注意力与激活强度。基于这些发现,我们提出了一种简单的去相关损失函数,通过降低BOS标记与其他标记之间的余弦相似度,有效缓解了中间汇聚标记与大规模激活现象。此外,我们的方法在高音频-视觉特征下采样率下改善了词错误率(WER),同时在较低下采样率下保持性能稳定。