Vision-language models (VLMs) are commonly trained by inserting image tokens from a pretrained vision encoder into the textual stream of a language model. This allows text and image information to fully attend to one another within the model, but becomes extremely costly for high-resolution images, long conversations, or streaming videos, both in memory and compute. VLMs leveraging cross-attention are an efficient alternative to token insertion but exhibit a clear performance gap, in particular on tasks involving fine-grained visual details. We find that a key to improving such models is to also enable local text-to-text interaction in the dedicated cross-attention layers. Building on this, we propose CASA, Cross-Attention via Self-Attention, a simple and efficient paradigm which substantially reduces the gap with full token insertion on common image understanding benchmarks, while enjoying the same scalability as cross-attention models when applied to long-context multimodal tasks such as streaming video captioning. For samples and code, please see our project page at https://kyutai.org/casa .
翻译:视觉语言模型通常通过将预训练视觉编码器生成的图像标记插入语言模型的文本流中进行训练。这种方法允许文本与图像信息在模型内部充分交互,但对于高分辨率图像、长对话或流式视频任务,其计算与内存开销极大。利用跨模态注意力的视觉语言模型是标记插入方法的高效替代方案,但在涉及精细视觉细节的任务上存在明显性能差距。我们发现,提升此类模型性能的关键在于在专用跨注意力层中同时实现局部文本-文本交互。基于此,我们提出CASA(通过自注意力实现跨模态注意力),这一简洁高效的范式在常见图像理解基准上显著缩小了与完整标记插入方法的性能差距,同时在处理流式视频描述等长上下文多模态任务时,保持了与跨注意力模型相同的可扩展性。相关示例与代码请访问项目页面:https://kyutai.org/casa。