Vision-Language Models (VLMs) are a new family of models that align image content with natural language. Existing approaches typically fuse either (a) early: by mixing tokens/features inside the encoders, or (b) late: by comparing pooled embeddings. Many methods also tie fusion to an autoregressive decoder. However, the hidden states of both modalities already carry rich, modality-specific structure (spatial layout in vision; syntax and semantics in text), so directly aligning these states is a natural way to match what the two modalities "think". We propose a lightweight fusion module: a few cross-only, bidirectional attention layers placed near the top of both encoders. Each layer projects the vision and text encoder hidden-state sequences into a shared space, attends across modalities, and sends gated residual updates back, with simple stabilizers to improve alignment. The encoders remain non-causal and strong for understanding, while generation stays cleanly decoupled via an optional decoder. Across standard retrieval, VQA, and visual reasoning benchmarks, BRIDGE outperforms comparable VLMs while preserving the bi-encoder efficiency of contrastive models. We make our code publicly available at https://github.com/jfeinashley/BRIDGE.
翻译:视觉语言模型(VLMs)是一类将图像内容与自然语言对齐的新型模型。现有方法通常采用两种融合方式:(a)早期融合:在编码器内部混合标记/特征;或(b)晚期融合:通过比较池化后的嵌入表示。许多方法还将融合与自回归解码器绑定。然而,两种模态的隐藏状态本身已携带丰富的模态特定结构(视觉中的空间布局;文本中的句法与语义),因此直接对齐这些状态是匹配两种模态“认知”的自然方式。我们提出一种轻量级融合模块:在两种编码器顶部附近放置若干仅跨模态的双向注意力层。每层将视觉与文本编码器的隐藏状态序列投影至共享空间,执行跨模态注意力计算,并通过门控残差更新回传,辅以简单稳定器以提升对齐效果。编码器保持非因果特性并维持强大的理解能力,而生成任务则通过可选解码器保持清晰解耦。在标准检索、视觉问答(VQA)和视觉推理基准测试中,BRIDGE在保持对比模型双编码器效率的同时,性能优于同类VLM模型。我们的代码已公开于https://github.com/jfeinashley/BRIDGE。