Vision-Language Models (VLMs) have achieved impressive progress in perceiving and describing visual environments. However, their ability to proactively reason and act based solely on visual inputs, without explicit textual prompts, remains underexplored. We introduce a new task, Visual Action Reasoning, and propose VisualActBench, a large-scale benchmark comprising 1,074 videos and 3,733 human-annotated actions across four real-world scenarios. Each action is labeled with an Action Prioritization Level (APL) and a proactive-reactive type to assess models' human-aligned reasoning and value sensitivity. We evaluate 29 VLMs on VisualActBench and find that while frontier models like GPT4o demonstrate relatively strong performance, a significant gap remains compared to human-level reasoning, particularly in generating proactive, high-priority actions. Our results highlight limitations in current VLMs' ability to interpret complex context, anticipate outcomes, and align with human decision-making frameworks. VisualActBench establishes a comprehensive foundation for assessing and improving the real-world readiness of proactive, vision-centric AI agents.
翻译:视觉语言模型在感知与描述视觉环境方面已取得显著进展。然而,其仅基于视觉输入(无需显式文本提示)进行主动推理与行动的能力仍未得到充分探索。本文提出一项新任务——视觉行动推理,并构建了VisualActBench大规模基准数据集,该数据集包含1,074个视频及3,733个人工标注的行动,覆盖四种真实场景。每个行动均标注了行动优先级与主动/反应类型,以评估模型与人类对齐的推理能力及价值敏感性。我们在VisualActBench上评估了29个视觉语言模型,发现尽管前沿模型(如GPT4o)表现出相对较强的性能,但与人类水平推理仍存在显著差距,尤其在生成主动型高优先级行动方面。研究结果揭示了当前视觉语言模型在理解复杂语境、预测结果以及与人类决策框架对齐方面的局限性。VisualActBench为评估和提升主动型视觉中心人工智能代理的现实应用能力奠定了全面基础。