Multi-page visual documents such as manuals, brochures, presentations, and posters convey key information through layout, colors, icons, and cross-slide references. While large language models (LLMs) offer opportunities in document understanding, current systems struggle with complex, multi-page visual documents, particularly in fine-grained reasoning over elements and pages. We introduce SlideAgent, a versatile agentic framework for understanding multi-modal, multi-page, and multi-layout documents, especially slide decks. SlideAgent employs specialized agents and decomposes reasoning into three specialized levels-global, page, and element-to construct a structured, query-agnostic representation that captures both overarching themes and detailed visual or textual cues. During inference, SlideAgent selectively activates specialized agents for multi-level reasoning and integrates their outputs into coherent, context-aware answers. Extensive experiments show that SlideAgent achieves significant improvement over both proprietary (+7.9 overall) and open-source models (+9.8 overall).
翻译:手册、宣传册、演示文稿和海报等多页视觉文档通过布局、颜色、图标以及跨页引用传递关键信息。尽管大语言模型(LLMs)为文档理解提供了机遇,但现有系统在处理复杂多页视觉文档时仍面临挑战,尤其是在元素与页面间的细粒度推理方面。本文提出SlideAgent,一个用于理解多模态、多页面、多布局文档(特别是幻灯片文档)的通用智能体框架。SlideAgent采用专业化智能体,将推理分解为三个层次——全局、页面与元素,以构建一种与查询无关的结构化表征,该表征既能捕捉整体主题,也能捕获细节的视觉或文本线索。在推理过程中,SlideAgent选择性地激活各层次的专业智能体进行多级推理,并将其输出整合为连贯且上下文感知的答案。大量实验表明,SlideAgent相较于专有模型(整体提升+7.9)与开源模型(整体提升+9.8)均实现了显著改进。