Despite recent progress in using Large Language Models (LLMs) for automatically generating 3D scenes, generated scenes often lack realistic spatial layouts and object attributes found in real-world environments. As this problem stems from insufficiently detailed, coarse-grained instructions, advancing 3D scene synthesis guided by more detailed, fine-grained instructions that reflect real-world environments becomes crucial. Without such realistic scenes, training embodied agents in unrealistic environments can lead them to learn priors that diverge significantly from real-world physics and semantics, degrading their performance when deployed. Thus, verifying the alignment between the fine-grained instruction and the generated scene is essential for effective learning. However, current evaluation methods, such as CLIPScore and vision-language models (VLMs), often fail to reliably assess such alignment. This shortcoming arises primarily from their shallow understanding of 3D scenes, which often leads to improperly grounded scene components. To address this, we introduce LEGO-Eval, an evaluation framework equipped with diverse tools designed to explicitly ground scene components, enabling more accurate alignment assessments. We also present LEGO-Bench, a benchmark of detailed instructions that specify complex layouts and attributes of real-world environments. Experiments demonstrate that LEGO-Eval outperforms VLM-as-a-judge by 0.41 F1 score in assessing scene-instruction alignment. Benchmarking with LEGO-Bench reveals significant limitations in current generation methods. Across all evaluated approaches, success rates reached at most 10% in generating scenes that fully align with fine-grained instructions.
翻译:尽管近期利用大型语言模型(LLMs)自动生成三维场景取得了进展,但生成的场景往往缺乏真实世界环境中常见的逼真空间布局与物体属性。这一问题源于指令描述不够详细、粒度过于粗糙,因此,基于更详细、细粒度且反映真实世界环境的指令来推进三维场景合成变得至关重要。若缺乏此类逼真场景,在非真实环境中训练具身智能体可能导致其学习到的先验知识与真实世界的物理规律及语义显著偏离,从而在部署时性能下降。因此,验证细粒度指令与生成场景之间的一致性对于有效学习至关重要。然而,当前的评估方法(如CLIPScore和视觉-语言模型(VLMs))往往无法可靠地评估这种一致性。这一缺陷主要源于它们对三维场景的理解较为浅层,常导致场景组件的基础信息不准确。为解决此问题,我们提出了LEGO-Eval,这是一个配备多样化工具的评估框架,旨在显式地锚定场景组件,从而实现更准确的一致性评估。我们还提出了LEGO-Bench,一个包含详细指令的基准数据集,这些指令规定了真实世界环境中复杂的布局与属性。实验表明,在评估场景-指令一致性时,LEGO-Eval在F1分数上比VLM-as-a-judge高出0.41分。使用LEGO-Bench进行基准测试揭示了当前生成方法的显著局限性:在所有评估的方法中,生成场景与细粒度指令完全一致的成功率最高仅为10%。