We present Stable Video Materials 3D (SViM3D), a framework to predict multi-view consistent physically based rendering (PBR) materials, given a single image. Recently, video diffusion models have been successfully used to reconstruct 3D objects from a single image efficiently. However, reflectance is still represented by simple material models or needs to be estimated in additional steps to enable relighting and controlled appearance edits. We extend a latent video diffusion model to output spatially varying PBR parameters and surface normals jointly with each generated view based on explicit camera control. This unique setup allows for relighting and generating a 3D asset using our model as neural prior. We introduce various mechanisms to this pipeline that improve quality in this ill-posed setting. We show state-of-the-art relighting and novel view synthesis performance on multiple object-centric datasets. Our method generalizes to diverse inputs, enabling the generation of relightable 3D assets useful in AR/VR, movies, games and other visual media.
翻译:本文提出稳定视频材质三维(SViM3D)框架,用于从单张图像预测多视角一致、基于物理渲染(PBR)的材质。近期,视频扩散模型已成功应用于从单张图像高效重建三维物体。然而,反射属性仍由简单材质模型表示,或需通过额外步骤估算以实现重光照和可控外观编辑。我们扩展了潜在视频扩散模型,使其在显式相机控制基础上,能联合输出空间变化的PBR参数和表面法线以及每个生成视角。这一独特架构允许将我们的模型作为神经先验,实现重光照并生成三维资产。我们为此流程引入了多种机制,以改善这一病态设定下的生成质量。我们在多个以物体为中心的数据集上展示了最先进的重光照和新视角合成性能。该方法能泛化至多样化输入,可生成适用于增强现实/虚拟现实、电影、游戏及其他视觉媒体的可重光照三维资产。