Traditional monocular depth estimation suffers from inherent ambiguity and visual nuisances. We demonstrate that language can enhance monocular depth estimation by providing an additional condition (rather than images alone) aligned with plausible 3D scenes, thereby reducing the solution space for depth estimation. This conditional distribution is learned during the text-to-image pre-training of diffusion models. To generate images under various viewpoints and layouts that precisely reflect textual descriptions, the model implicitly models object sizes, shapes, and scales, their spatial relationships, and the overall scene structure. In this paper, Iris, we investigate the benefits of our strategy to integrate text descriptions into training and inference of diffusion-based depth estimation models. We experiment with three different diffusion-based monocular depth estimators (Marigold, Lotus, and E2E-FT) and their variants. By training on HyperSim and Virtual KITTI, and evaluating on NYUv2, KITTI, ETH3D, ScanNet, and DIODE, we find that our strategy improves the overall monocular depth estimation accuracy, especially in small areas. It also improves the model's depth perception of specific regions described in the text. We find that by providing more details in the text, the depth prediction can be iteratively refined. Simultaneously, we find that language can act as a constraint to accelerate the convergence of both training and the inference diffusion trajectory. Code and generated text data will be released upon acceptance.
翻译:传统的单目深度估计方法受限于固有的模糊性和视觉干扰。我们证明,语言可以通过提供与合理三维场景对齐的附加条件(而非仅依赖图像)来增强单目深度估计,从而缩小深度估计的解空间。这种条件分布在扩散模型的文本到图像预训练过程中被学习。为了生成精确反映文本描述的不同视角和布局的图像,模型隐式地建模了物体的大小、形状和尺度、它们的空间关系以及整体场景结构。在本文中,我们通过Iris研究了将文本描述集成到基于扩散的深度估计模型的训练和推理中的策略优势。我们实验了三种不同的基于扩散的单目深度估计器(Marigold、Lotus和E2E-FT)及其变体。通过在HyperSim和Virtual KITTI上训练,并在NYUv2、KITTI、ETH3D、ScanNet和DIODE上评估,我们发现该策略提高了整体单目深度估计的准确性,尤其是在小区域。它还增强了模型对文本描述的特定区域的深度感知能力。我们发现,通过提供更详细的文本描述,深度预测可以迭代细化。同时,我们发现语言可以作为约束来加速训练和推理扩散轨迹的收敛。代码和生成的文本数据将在论文被接受后发布。