Verifiable geometric reasoning is a critical component for trustworthy and controllable agentic AI. Despite impressive capabilities, Vision-Language Models (VLMs) often fail under realistic scene changes. We present Tri-Bench, a compact benchmark of planar triangle problems that isolates relative geometric reasoning while stressing two deployment-critical factors: camera pose (planar vs. tilted) and scene context via object interference (10 everyday objects). To test verifiability and control, we evaluate four recent VLMs using a single, fixed prompt whose guardrail explicitly describes a surrounding square border, enabling correct answers via homography. We evaluate six simple tasks over binary and continuous targets, and observe that the overall accuracy with respect to 3D ground truth is modest, ~69% on average (best ~75%, worst ~64%). The same responses align even more closely with 2D projections in the image plane, where mean accuracy is ~72%. All four VLMs consistently fail, with accuracy falling to ~0%, on recognizing minority shape classes (equilateral, isosceles, right-angled triangles). Additionally, overall VLM accuracy degrades by ~4.1% under camera tilt. This demonstrates that models fail to correctly utilize the explicit frame-of-reference hint provided in the prompt and default to 2D image plane cues. Finally, we find that object interference has no significant effect on VLM accuracy.
翻译:可验证的几何推理是构建可信赖与可控智能体AI的关键组成部分。尽管视觉语言模型(VLMs)展现出令人印象深刻的能力,但在现实场景变化下常常失效。我们提出了Tri-Bench,一个紧凑的平面三角形问题基准,该基准在强调两个部署关键因素——相机姿态(平面与倾斜)和通过物体干扰(10种日常物体)的场景上下文——的同时,隔离了相对几何推理。为了测试可验证性与可控性,我们使用一个单一、固定的提示词评估了四种近期VLMs,该提示词的防护栏明确描述了一个包围的方形边框,从而能够通过单应性变换获得正确答案。我们评估了针对二元和连续目标的六项简单任务,并观察到相对于三维地面真值的总体准确率较为一般,平均约为69%(最佳约75%,最差约64%)。同样的回答与图像平面中的二维投影对齐得更为紧密,其中平均准确率约为72%。所有四种VLMs在识别少数形状类别(等边三角形、等腰三角形、直角三角形)时均持续失败,准确率降至约0%。此外,在相机倾斜条件下,VLM的总体准确率下降了约4.1%。这表明模型未能正确利用提示词中提供的明确参考框架提示,而是默认依赖二维图像平面线索。最后,我们发现物体干扰对VLM准确率没有显著影响。