World models have become crucial for autonomous driving, as they learn how scenarios evolve over time to address the long-tail challenges of the real world. However, current approaches relegate world models to limited roles: they operate within ostensibly unified architectures that still keep world prediction and motion planning as decoupled processes. To bridge this gap, we propose DriveLaW, a novel paradigm that unifies video generation and motion planning. By directly injecting the latent representation from its video generator into the planner, DriveLaW ensures inherent consistency between high-fidelity future generation and reliable trajectory planning. Specifically, DriveLaW consists of two core components: DriveLaW-Video, our powerful world model that generates high-fidelity forecasting with expressive latent representations, and DriveLaW-Act, a diffusion planner that generates consistent and reliable trajectories from the latent of DriveLaW-Video, with both components optimized by a three-stage progressive training strategy. The power of our unified paradigm is demonstrated by new state-of-the-art results across both tasks. DriveLaW not only advances video prediction significantly, surpassing best-performing work by 33.3% in FID and 1.8% in FVD, but also achieves a new record on the NAVSIM planning benchmark.
翻译:世界模型已成为自动驾驶领域的关键技术,其通过学习场景随时间演变的规律来应对现实世界中的长尾挑战。然而,当前方法将世界模型局限于有限角色:它们在表面统一的架构中运行,却仍将世界预测与运动规划视为解耦的过程。为弥合这一差距,我们提出DriveLaW——一种统一视频生成与运动规划的新范式。通过将视频生成器的潜在表征直接注入规划器,DriveLaW确保了高保真未来生成与可靠轨迹规划之间的内在一致性。具体而言,DriveLaW包含两个核心组件:DriveLaW-Video(我们强大的世界模型,能生成具有表达性潜在表征的高保真预测)和DriveLaW-Act(基于扩散的规划器,可从DriveLaW-Video的潜在空间中生成一致且可靠的轨迹),二者通过三阶段渐进式训练策略进行协同优化。我们在两项任务中取得的全新最先进成果证明了该统一范式的强大性能:DriveLaW不仅在视频预测方面取得显著突破(FID指标超越最佳现有工作33.3%,FVD指标提升1.8%),更在NAVSIM规划基准测试中创造了新纪录。