幻影：面向驾驶场景真实感与连贯性资产编辑的一步式视频扩散模型 (Mirage: One-Step Video Diffusion for Photorealistic and Coherent Asset Editing in Driving Scenes)

Vision-centric autonomous driving systems rely on diverse and scalable training data to achieve robust performance. While video object editing offers a promising path for data augmentation, existing methods often struggle to maintain both high visual fidelity and temporal coherence. In this work, we propose \textbf{Mirage}, a one-step video diffusion model for photorealistic and coherent asset editing in driving scenes. Mirage builds upon a text-to-video diffusion prior to ensure temporal consistency across frames. However, 3D causal variational autoencoders often suffer from degraded spatial fidelity due to compression, and directly passing 3D encoder features to decoder layers breaks temporal causality. To address this, we inject temporally agnostic latents from a pretrained 2D encoder into the 3D decoder to restore detail while preserving causal structures. Furthermore, because scene objects and inserted assets are optimized under different objectives, their Gaussians exhibit a distribution mismatch that leads to pose misalignment. To mitigate this, we introduce a two-stage data alignment strategy combining coarse 3D alignment and fine 2D refinement, thereby improving alignment and providing cleaner supervision. Extensive experiments demonstrate that Mirage achieves high realism and temporal consistency across diverse editing scenarios. Beyond asset editing, Mirage can also generalize to other video-to-video translation tasks, serving as a reliable baseline for future research. Our code is available at https://github.com/wm-research/mirage.

翻译：以视觉为中心的自动驾驶系统依赖于多样化和可扩展的训练数据以实现鲁棒性能。尽管视频对象编辑为数据增强提供了一条有前景的路径，但现有方法往往难以同时保持高视觉保真度与时间连贯性。在本工作中，我们提出 **Mirage**，一种用于驾驶场景中真实感与连贯性资产编辑的一步式视频扩散模型。Mirage 基于文本到视频扩散先验构建，以确保跨帧的时间一致性。然而，3D 因果变分自编码器常因压缩而导致空间保真度下降，且直接将 3D 编码器特征传递至解码器层会破坏时间因果性。为解决此问题，我们向 3D 解码器中注入来自预训练 2D 编码器的时序无关潜在表示，以在保持因果结构的同时恢复细节。此外，由于场景对象与插入资产在不同目标下进行优化，其高斯分布存在不匹配，导致姿态错位。为缓解此问题，我们引入了一种结合粗粒度 3D 对齐与细粒度 2D 优化的两阶段数据对齐策略，从而改善对齐效果并提供更清晰的监督信号。大量实验表明，Mirage 在多样化编辑场景中均实现了高真实度与时间一致性。除资产编辑外，Mirage 还可泛化至其他视频到视频转换任务，为未来研究提供可靠的基准。我们的代码公开于 https://github.com/wm-research/mirage。