While Vision-language models (VLMs) have demonstrated remarkable performance across multi-modal tasks, their choice of vision encoders presents a fundamental weakness: their low-level features lack the robust structural and spatial information essential for document understanding and web agents. To bridge this gap, we introduce DAVE, a vision encoder purpose-built for VLMs and tailored for these tasks. Our training pipeline is designed to leverage abundant unlabeled data to bypass the need for costly large-scale annotations for document and web images. We begin with a self-supervised pretraining stage on unlabeled images, followed by a supervised autoregressive pretraining stage, where the model learns tasks like parsing and localization from limited, high-quality data. Within the supervised stage, we adopt two strategies to improve our encoder's alignment with both general visual knowledge and diverse document and web agentic tasks: (i) We introduce a novel model-merging scheme, combining encoders trained with different text decoders to ensure broad compatibility with different web agentic architectures. (ii) We use ensemble training to fuse features from pretrained generalist encoders (e.g., SigLIP2) with our own document and web-specific representations. Extensive experiments on classic document tasks, VQAs, web localization, and agent-based benchmarks validate the effectiveness of our approach, establishing DAVE as a strong vision encoder for document and web applications.
翻译:尽管视觉语言模型(VLMs)在多模态任务中展现出卓越性能,但其视觉编码器的选择存在一个根本性缺陷:其低层特征缺乏文档理解与网页智能体所必需的鲁棒结构及空间信息。为弥补这一差距,我们提出了DAVE——一款专为VLMs设计、针对上述任务定制的视觉编码器。我们的训练流程旨在利用海量无标注数据,从而避免为文档及网页图像进行昂贵的大规模标注。训练首先在无标注图像上进行自监督预训练阶段,随后进入监督自回归预训练阶段,模型通过有限的高质量数据学习解析与定位等任务。在监督训练阶段,我们采用两种策略以提升编码器与通用视觉知识及多样化文档/网页智能体任务的协同能力:(i)我们提出一种新颖的模型融合方案,通过整合经不同文本解码器训练的编码器,确保其与各类网页智能体架构的广泛兼容性;(ii)采用集成训练方法,将预训练的通用编码器(如SigLIP2)特征与我们自有的文档及网页专用表征相融合。在经典文档任务、视觉问答、网页定位及基于智能体的基准测试上的大量实验验证了我们方法的有效性,确立了DAVE作为文档与网页应用领域强大视觉编码器的地位。