Trustworthy robot behavior requires not only high levels of task success but also that the robot can reliably quantify how likely it is to succeed. To this end, we present a first-of-its-kind study of confidence calibration in vision-language-action (VLA) foundation models, which map visual observations and natural language instructions to low-level robot motor commands. We establish a confidence baseline for VLAs, examine how task success relates to calibration error and how calibration evolves over time, and introduce two lightweight techniques to remedy the miscalibration we observe: prompt ensembles and action-wise Platt scaling. Our aim in this study is to begin to develop the tools and conceptual understanding necessary to render VLAs both highly performant and highly trustworthy via reliable uncertainty quantification.
翻译:可信的机器人行为不仅要求任务成功率高,还要求机器人能够可靠地量化其成功的可能性。为此,我们首次对视觉-语言-动作基础模型中的置信度校准进行了研究,这类模型将视觉观察和自然语言指令映射为低层级的机器人运动指令。我们为VLA模型建立了置信度基线,检验了任务成功率与校准误差之间的关系以及校准如何随时间演变,并引入了两种轻量级技术来纠正我们观察到的校准偏差:提示集成与动作级普拉特缩放。本研究旨在初步开发必要的工具和概念理解,以通过可靠的不确定性量化,使VLA模型既具备高性能又具备高可信度。