SlideAgent：面向多页视觉文档理解的分层智能体框架 (SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding)

Multi-page visual documents such as manuals, brochures, presentations, and posters convey key information through layout, colors, icons, and cross-slide references. While large language models (LLMs) offer opportunities in document understanding, current systems struggle with complex, multi-page visual documents, particularly in fine-grained reasoning over elements and pages. We introduce SlideAgent, a versatile agentic framework for understanding multi-modal, multi-page, and multi-layout documents, especially slide decks. SlideAgent employs specialized agents and decomposes reasoning into three specialized levels-global, page, and element-to construct a structured, query-agnostic representation that captures both overarching themes and detailed visual or textual cues. During inference, SlideAgent selectively activates specialized agents for multi-level reasoning and integrates their outputs into coherent, context-aware answers. Extensive experiments show that SlideAgent achieves significant improvement over both proprietary (+7.9 overall) and open-source models (+9.8 overall).

翻译：手册、宣传册、演示文稿和海报等多页视觉文档通过布局、颜色、图标以及跨页引用传递关键信息。尽管大语言模型（LLMs）为文档理解提供了机遇，但现有系统在处理复杂多页视觉文档时仍面临挑战，尤其是在元素与页面间的细粒度推理方面。本文提出SlideAgent，一个用于理解多模态、多页面、多布局文档（特别是幻灯片文档）的通用智能体框架。SlideAgent采用专业化智能体，将推理分解为三个层次——全局、页面与元素，以构建一种与查询无关的结构化表征，该表征既能捕捉整体主题，也能捕获细节的视觉或文本线索。在推理过程中，SlideAgent选择性地激活各层次的专业智能体进行多级推理，并将其输出整合为连贯且上下文感知的答案。大量实验表明，SlideAgent相较于专有模型（整体提升+7.9）与开源模型（整体提升+9.8）均实现了显著改进。