We present DriveGen3D, a novel framework for generating high-quality and highly controllable dynamic 3D driving scenes that addresses critical limitations in existing methodologies. Current approaches to driving scene synthesis either suffer from prohibitive computational demands for extended temporal generation, focus exclusively on prolonged video synthesis without 3D representation, or restrict themselves to static single-scene reconstruction. Our work bridges this methodological gap by integrating accelerated long-term video generation with large-scale dynamic scene reconstruction through multimodal conditional control. DriveGen3D introduces a unified pipeline consisting of two specialized components: FastDrive-DiT, an efficient video diffusion transformer for high-resolution, temporally coherent video synthesis under text and Bird's-Eye-View (BEV) layout guidance; and FastRecon3D, a feed-forward module that rapidly builds 3D Gaussian representations across time, ensuring spatial-temporal consistency. DriveGen3D enable the generation of long driving videos (up to $800\times424$ at $12$ FPS) and corresponding 3D scenes, achieving state-of-the-art results while maintaining efficiency.
翻译:我们提出了DriveGen3D,一个用于生成高质量且高度可控的动态3D驾驶场景的新型框架,旨在解决现有方法中的关键局限性。当前的驾驶场景合成方法要么因长时间序列生成而产生过高的计算需求,要么仅专注于长时间视频合成而缺乏3D表示,或者局限于静态单场景重建。我们的工作通过多模态条件控制,将加速的长期视频生成与大规模动态场景重建相结合,从而弥合了这一方法论上的差距。DriveGen3D引入了一个统一流程,包含两个专用组件:FastDrive-DiT,一个在文本和鸟瞰图布局引导下进行高分辨率、时间连贯视频合成的高效视频扩散Transformer;以及FastRecon3D,一个前馈模块,能够快速构建跨时间序列的3D高斯表示,确保时空一致性。DriveGen3D能够生成长驾驶视频(最高可达$800\times424$分辨率,$12$ FPS)及相应的3D场景,在保持高效的同时实现了最先进的性能。