Video diffusion techniques have advanced significantly in recent years; however, they struggle to generate realistic imagery of car crashes due to the scarcity of accident events in most driving datasets. Improving traffic safety requires realistic and controllable accident simulations. To tackle the problem, we propose Ctrl-Crash, a controllable car crash video generation model that conditions on signals such as bounding boxes, crash types, and an initial image frame. Our approach enables counterfactual scenario generation where minor variations in input can lead to dramatically different crash outcomes. To support fine-grained control at inference time, we leverage classifier-free guidance with independently tunable scales for each conditioning signal. Ctrl-Crash achieves state-of-the-art performance across quantitative video quality metrics (e.g., FVD and JEDi) and qualitative measurements based on a human-evaluation of physical realism and video quality compared to prior diffusion-based methods.
翻译:近年来,视频扩散技术取得了显著进展;然而,由于大多数驾驶数据集中事故事件的稀缺性,这些技术在生成逼真的汽车碰撞图像方面仍面临挑战。提升交通安全需要真实且可控的事故模拟。为解决该问题,我们提出了Ctrl-Crash,一种可控的汽车碰撞视频生成模型,其条件输入包括边界框、碰撞类型和初始图像帧等信号。我们的方法支持反事实场景生成,其中输入的微小变化可导致截然不同的碰撞结果。为实现推理时的细粒度控制,我们采用无分类器引导技术,并为每个条件信号设置独立可调的尺度。与先前的基于扩散的方法相比,Ctrl-Crash在定量视频质量指标(如FVD和JEDi)以及基于人工评估的物理真实性和视频质量的定性测量中,均实现了最先进的性能。