Long-form video understanding presents significant challenges due to extensive temporal-spatial complexity and the difficulty of question answering under such extended contexts. While Large Language Models (LLMs) have demonstrated considerable advancements in video analysis capabilities and long context handling, they continue to exhibit limitations when processing information-dense hour-long videos. To overcome such limitations, we propose the Deep Video Discovery (DVD) agent to leverage an agentic search strategy over segmented video clips. Unlike previous video agents that rely on predefined workflows applied uniformly across different queries, our approach emphasizes the autonomous and adaptive nature of agents. By providing a set of search-centric tools on multi-granular video database, our DVD agent leverages the advanced reasoning capability of LLM to plan on its current observation state, strategically selects tools to orchestrate adaptive workflow for different queries in light of the gathered information. We perform comprehensive evaluation on multiple long video understanding benchmarks that demonstrates our advantage. Our DVD agent achieves state-of-the-art performance on the challenging LVBench dataset, reaching an accuracy of 74.2%, which substantially surpasses all prior works, and further improves to 76.0% with transcripts. The code has been released at https://github.com/microsoft/DeepVideoDiscovery.
翻译:长视频理解因存在复杂的时空结构以及在长上下文情境下进行问答的困难而面临重大挑战。尽管大型语言模型(LLMs)在视频分析能力和长上下文处理方面已展现出显著进步,但在处理信息密集的时长一小时以上的视频时仍存在局限。为克服这些局限,我们提出了深度视频发现(DVD)智能体,利用基于分段视频片段的智能搜索策略。与以往依赖预定义工作流并统一应用于不同查询的视频智能体不同,我们的方法强调智能体的自主性与自适应性。通过在多粒度视频数据库上提供一套以搜索为核心的工具,我们的DVD智能体利用LLM的高级推理能力,基于当前观测状态进行规划,并依据已收集的信息,策略性地选择工具以针对不同查询编排自适应工作流。我们在多个长视频理解基准测试上进行了全面评估,结果证明了我们的优势。我们的DVD智能体在具有挑战性的LVBench数据集上取得了最先进的性能,准确率达到74.2%,显著超越了所有先前工作,并在加入转录文本后进一步提升至76.0%。代码已发布于https://github.com/microsoft/DeepVideoDiscovery。