Recent advances in diffusion-based video generation have opened new possibilities for controllable video editing, yet realistic video object insertion (VOI) remains challenging due to limited 4D scene understanding and inadequate handling of occlusion and lighting effects. We present InsertAnywhere, a new VOI framework that achieves geometrically consistent object placement and appearance-faithful video synthesis. Our method begins with a 4D aware mask generation module that reconstructs the scene geometry and propagates user specified object placement across frames while maintaining temporal coherence and occlusion consistency. Building upon this spatial foundation, we extend a diffusion based video generation model to jointly synthesize the inserted object and its surrounding local variations such as illumination and shading. To enable supervised training, we introduce ROSE++, an illumination aware synthetic dataset constructed by transforming the ROSE object removal dataset into triplets of object removed video, object present video, and a VLM generated reference image. Through extensive experiments, we demonstrate that our framework produces geometrically plausible and visually coherent object insertions across diverse real world scenarios, significantly outperforming existing research and commercial models.
翻译:基于扩散的视频生成技术的最新进展为可控视频编辑开辟了新途径,然而由于对4D场景理解的局限以及对遮挡和光照效应处理的不足,逼真的视频物体插入(VOI)仍面临挑战。本文提出InsertAnywhere,一种新的VOI框架,能够实现几何一致的物体放置和外观保真的视频合成。我们的方法始于一个4D感知掩码生成模块,该模块重建场景几何结构,并在保持时间连贯性和遮挡一致性的前提下,将用户指定的物体放置跨帧传播。基于此空间基础,我们扩展了一个基于扩散的视频生成模型,以联合合成插入的物体及其周围局部变化(如光照和阴影)。为支持监督训练,我们引入了ROSE++——一个光照感知的合成数据集,该数据集通过将ROSE物体移除数据集转换为“物体移除视频-物体存在视频-VLM生成参考图像”三元组构建而成。通过大量实验,我们证明该框架能在多样化的真实场景中生成几何合理且视觉连贯的物体插入效果,显著优于现有研究和商业模型。