Long-video understanding~(LVU) is a challenging problem in computer vision. Existing methods either downsample frames for single-pass reasoning, sacrificing fine-grained details, or depend on textual reasoning over task-agnostic representations, hindering task-specific perception and exploration. In this paper, we propose VideoExplorer, a framework grounded in the principle of ``thinking with video'', which naturally intertwines planning, temporal grounding, and scalable perception into a coherent reasoning process. Rather than reasoning over a static context, VideoExplorer iteratively formulates sub-questions, locates relevant moments, and performs task-oriented, temporally scalable video understanding until reaching the final answer, enabling faithful, efficient, and interpretable reasoning. To address the lack of LVU training resources, we construct a long-video reasoning dataset using difficulty-adaptive sampling to ensure high-quality trajectories on complex tasks. Building on this dataset, we design a two-stage training pipeline: supervised trajectory initialization followed by trajectory-level preference optimization, encouraging adaptive temporal grounding and iterative information integration guided by downstream rewards. Extensive evaluations on popular long-video understanding and reasoning benchmarks demonstrate VideoExplorer's significant advantage over existing baselines, highlighting its robustness, adaptability, and efficiency. Our code is made publicly available in this repository(https://github.com/yhy-2000/VideoDeepResearch).
翻译:长视频理解(LVU)是计算机视觉领域的一个具有挑战性的问题。现有方法要么对视频帧进行下采样以进行单次推理,牺牲了细粒度细节,要么依赖于对任务无关的文本表示进行推理,阻碍了任务特定的感知与探索。本文提出VideoExplorer框架,其基于“用视频思考”的原则,将规划、时序定位和可扩展的感知自然地交织成一个连贯的推理过程。VideoExplorer并非在静态上下文中进行推理,而是迭代地提出子问题、定位相关片段,并执行面向任务的、时序可扩展的视频理解,直至得出最终答案,从而实现忠实、高效且可解释的推理。针对LVU训练资源匮乏的问题,我们通过难度自适应采样构建了一个长视频推理数据集,以确保复杂任务上的高质量推理轨迹。基于此数据集,我们设计了一个两阶段训练流程:首先进行监督式的轨迹初始化,然后进行轨迹级别的偏好优化,以鼓励在后续奖励指导下进行自适应的时序定位和迭代式信息整合。在多个主流的长视频理解与推理基准测试上的广泛评估表明,VideoExplorer相比现有基线方法具有显著优势,突显了其鲁棒性、适应性和高效性。我们的代码已在以下仓库中公开(https://github.com/yhy-2000/VideoDeepResearch)。