LongVie 2：多模态可控超长视频世界模型 (LongVie 2: Multimodal Controllable Ultra-Long Video World Model)

Building video world models upon pretrained video generation systems represents an important yet challenging step toward general spatiotemporal intelligence. A world model should possess three essential properties: controllability, long-term visual quality, and temporal consistency. To this end, we take a progressive approach-first enhancing controllability and then extending toward long-term, high-quality generation. We present LongVie 2, an end-to-end autoregressive framework trained in three stages: (1) Multi-modal guidance, which integrates dense and sparse control signals to provide implicit world-level supervision and improve controllability; (2) Degradation-aware training on the input frame, bridging the gap between training and long-term inference to maintain high visual quality; and (3) History-context guidance, which aligns contextual information across adjacent clips to ensure temporal consistency. We further introduce LongVGenBench, a comprehensive benchmark comprising 100 high-resolution one-minute videos covering diverse real-world and synthetic environments. Extensive experiments demonstrate that LongVie 2 achieves state-of-the-art performance in long-range controllability, temporal coherence, and visual fidelity, and supports continuous video generation lasting up to five minutes, marking a significant step toward unified video world modeling.

翻译：在预训练视频生成系统之上构建视频世界模型，是实现通用时空智能的重要且具挑战性的一步。一个世界模型应具备三个基本属性：可控性、长期视觉质量和时间一致性。为此，我们采用渐进式方法——首先增强可控性，然后向长期、高质量生成扩展。我们提出了LongVie 2，这是一个端到端的自回归框架，分三个阶段进行训练：(1) 多模态引导，整合稠密与稀疏控制信号以提供隐式的世界级监督并提升可控性；(2) 输入帧的退化感知训练，弥合训练与长期推理之间的差距以维持高视觉质量；(3) 历史上下文引导，对齐相邻片段间的上下文信息以确保时间一致性。我们进一步引入了LongVGenBench，这是一个包含100个高分辨率一分钟视频的综合基准，涵盖多样化的真实世界与合成环境。大量实验表明，LongVie 2在长程可控性、时间连贯性和视觉保真度方面达到了最先进的性能，并支持持续长达五分钟的视频生成，标志着向统一视频世界建模迈出了重要一步。