视觉语言模型中的隐藏状态桥接 (Bridging Hidden States in Vision-Language Models)

Vision-Language Models (VLMs) are a new family of models that align image content with natural language. Existing approaches typically fuse either (a) early: by mixing tokens/features inside the encoders, or (b) late: by comparing pooled embeddings. Many methods also tie fusion to an autoregressive decoder. However, the hidden states of both modalities already carry rich, modality-specific structure (spatial layout in vision; syntax and semantics in text), so directly aligning these states is a natural way to match what the two modalities "think". We propose a lightweight fusion module: a few cross-only, bidirectional attention layers placed near the top of both encoders. Each layer projects the vision and text encoder hidden-state sequences into a shared space, attends across modalities, and sends gated residual updates back, with simple stabilizers to improve alignment. The encoders remain non-causal and strong for understanding, while generation stays cleanly decoupled via an optional decoder. Across standard retrieval, VQA, and visual reasoning benchmarks, BRIDGE outperforms comparable VLMs while preserving the bi-encoder efficiency of contrastive models. We make our code publicly available at https://github.com/jfeinashley/BRIDGE.

翻译：视觉语言模型（VLMs）是一类将图像内容与自然语言对齐的新型模型。现有方法通常采用两种融合方式：（a）早期融合：在编码器内部混合标记/特征；或（b）晚期融合：通过比较池化后的嵌入表示。许多方法还将融合与自回归解码器绑定。然而，两种模态的隐藏状态本身已携带丰富的模态特定结构（视觉中的空间布局；文本中的句法与语义），因此直接对齐这些状态是匹配两种模态“认知”的自然方式。我们提出一种轻量级融合模块：在两种编码器顶部附近放置若干仅跨模态的双向注意力层。每层将视觉与文本编码器的隐藏状态序列投影至共享空间，执行跨模态注意力计算，并通过门控残差更新回传，辅以简单稳定器以提升对齐效果。编码器保持非因果特性并维持强大的理解能力，而生成任务则通过可选解码器保持清晰解耦。在标准检索、视觉问答（VQA）和视觉推理基准测试中，BRIDGE在保持对比模型双编码器效率的同时，性能优于同类VLM模型。我们的代码已公开于https://github.com/jfeinashley/BRIDGE。