We present Streamo, a real-time streaming video LLM that serves as a general-purpose interactive assistant. Unlike existing online video models that focus narrowly on question answering or captioning, Streamo performs a broad spectrum of streaming video tasks, including real-time narration, action understanding, event captioning, temporal event grounding, and time-sensitive question answering. To develop such versatility, we construct Streamo-Instruct-465K, a large-scale instruction-following dataset tailored for streaming video understanding. The dataset covers diverse temporal contexts and multi-task supervision, enabling unified training across heterogeneous streaming tasks. After training end-to-end on the instruction-following dataset through a streamlined pipeline, Streamo exhibits strong temporal reasoning, responsive interaction, and broad generalization across a variety of streaming benchmarks. Extensive experiments show that Streamo bridges the gap between offline video perception models and real-time multimodal assistants, making a step toward unified, intelligent video understanding in continuous video streams.
翻译:我们提出Streamo,一种实时流式视频大语言模型,作为通用交互式助手。与现有专注于问答或字幕生成的在线视频模型不同,Streamo能够执行广泛的流式视频任务,包括实时叙述、动作理解、事件描述、时序事件定位以及时间敏感问答。为实现这种多功能性,我们构建了Streamo-Instruct-465K——一个专为流式视频理解定制的大规模指令遵循数据集。该数据集涵盖多样化的时序上下文与多任务监督信号,支持跨异构流式任务的统一训练。通过简化的训练流程对指令数据集进行端到端训练后,Streamo在多种流式基准测试中展现出强大的时序推理能力、实时响应交互特性以及广泛的泛化性能。大量实验表明,Streamo弥合了离线视频感知模型与实时多模态助手之间的鸿沟,为连续视频流中的统一智能视频理解迈出了重要一步。