Video Question Answering (VideoQA) task serves as a critical playground for evaluating whether foundation models can effectively perceive, understand, and reason about dynamic real-world scenarios. However, existing Multimodal Large Language Models (MLLMs) struggle with simultaneously modeling spatial relationships within video frames and understanding the causal dynamics of temporal evolution on complex and reasoning-intensive VideoQA task. In this work, we equip MLLM with a comprehensive and extensible Video Toolkit, to enhance MLLM's spatiotemporal reasoning capabilities and ensure the harmony between the quantity and diversity of tools. To better control the tool invocation sequence and avoid toolchain shortcut issues, we propose a Spatiotemporal Reasoning Framework (STAR) that strategically schedules temporal and spatial tools, thereby progressively localizing the key area in the video. Our STAR framework enhances GPT-4o using lightweight tools, achieving an 8.2% gain on VideoMME and 4.6% on LongVideoBench. We believe that our proposed Video Toolkit and STAR framework make an important step towards building autonomous and intelligent video analysis assistants. The code is publicly available at https://github.com/fansunqi/VideoTool.
翻译:视频问答任务作为评估基础模型能否有效感知、理解并推理动态现实场景的关键试验场。然而,现有的多模态大语言模型在复杂且推理密集的视频问答任务中,难以同时建模视频帧内的空间关系并理解时间演化的因果动态。在本工作中,我们为多模态大语言模型配备了一个全面且可扩展的视频工具包,以增强其时空推理能力,并确保工具数量与多样性的协调。为了更好地控制工具调用序列并避免工具链捷径问题,我们提出了一种时空推理框架,该框架策略性地调度时间和空间工具,从而逐步定位视频中的关键区域。我们的时空推理框架使用轻量级工具增强了GPT-4o,在VideoMME上实现了8.2%的性能提升,在LongVideoBench上实现了4.6%的提升。我们相信,所提出的视频工具包和时空推理框架为构建自主智能的视频分析助手迈出了重要一步。代码公开于https://github.com/fansunqi/VideoTool。