Visual-Language-Action (VLA) models report impressive success rates on robotic manipulation benchmarks, yet these results may mask fundamental weaknesses in robustness. We perform a systematic vulnerability analysis by introducing controlled perturbations across seven dimensions: objects layout, camera viewpoints, robot initial states, language instructions, light conditions, background textures and sensor noise. We comprehensively analyzed multiple state-of-the-art models and revealed consistent brittleness beneath apparent competence. Our analysis exposes critical weaknesses: models exhibit extreme sensitivity to perturbation factors, including camera viewpoints and robot initial states, with performance dropping from 95% to below 30% under modest perturbations. Surprisingly, models are largely insensitive to language variations, with further experiments revealing that models tend to ignore language instructions completely. Our findings challenge the assumption that high benchmark scores equate to true competency and highlight the need for evaluation practices that assess reliability under realistic variation.
翻译:视觉-语言-动作(VLA)模型在机器人操作基准测试中报告了令人印象深刻的成功率,但这些结果可能掩盖了其鲁棒性方面的根本性弱点。我们通过引入七个维度的受控扰动进行了系统性脆弱性分析:物体布局、相机视角、机器人初始状态、语言指令、光照条件、背景纹理和传感器噪声。我们全面分析了多个最先进的模型,揭示了在表面能力之下存在一致的脆弱性。我们的分析暴露了关键弱点:模型对扰动因素表现出极端敏感性,包括相机视角和机器人初始状态,在适度扰动下性能从95%下降至30%以下。令人惊讶的是,模型对语言变化基本不敏感,进一步的实验表明模型倾向于完全忽略语言指令。我们的研究结果挑战了高基准分数等同于真实能力的假设,并强调了需要建立评估实践以衡量模型在现实变化下的可靠性。