Despite advancements in Video Large Language Models (Vid-LLMs) improving multimodal understanding, challenges persist in streaming video reasoning due to its reliance on contextual information. Existing paradigms feed all available historical contextual information into Vid-LLMs, resulting in a significant computational burden for visual data processing. Furthermore, the inclusion of irrelevant context distracts models from key details. This paper introduces a challenging task called Context-guided Streaming Video Reasoning (CogStream), which simulates real-world streaming video scenarios, requiring models to identify the most relevant historical contextual information to deduce answers for questions about the current stream. To support CogStream, we present a densely annotated dataset featuring extensive and hierarchical question-answer pairs, generated by a semi-automatic pipeline. Additionally, we present CogReasoner as a baseline model. It effectively tackles this task by leveraging visual stream compression and historical dialogue retrieval. Extensive experiments prove the effectiveness of this method.
翻译:尽管视频大语言模型(Vid-LLMs)的进步提升了多模态理解能力,但由于其对上下文信息的依赖,流式视频推理仍面临挑战。现有范式将所有可用的历史上下文信息输入Vid-LLMs,导致视觉数据处理的计算负担显著增加。此外,无关上下文的引入会分散模型对关键细节的注意力。本文引入了一项名为上下文引导的流式视频推理(CogStream)的挑战性任务,该任务模拟真实世界的流式视频场景,要求模型识别最相关的历史上下文信息,以推断关于当前视频流问题的答案。为支持CogStream研究,我们提出了一个通过半自动流程生成的、包含大量分层级问答对的密集标注数据集。此外,我们提出了CogReasoner作为基线模型。该模型通过利用视觉流压缩和历史对话检索,有效应对此项任务。大量实验证明了该方法的有效性。