The growing capabilities of large language models (LLMs) in instruction-following and context-understanding lead to the era of agents with numerous applications. Among these, task planning agents have become especially prominent in realistic scenarios involving complex internal pipelines, such as context understanding, tool management, and response generation. However, existing benchmarks predominantly evaluate agent performance based on task completion as a proxy for overall effectiveness. We hypothesize that merely improving task completion is misaligned with maximizing user satisfaction, as users interact with the entire agentic process and not only the end result. To address this gap, we propose AURA, an Agent-User inteRaction Assessment framework that conceptualizes the behavioral stages of interactive task planning agents. AURA offers a comprehensive assessment of agent through a set of atomic LLM evaluation criteria, allowing researchers and practitioners to diagnose specific strengths and weaknesses within the agent's decision-making pipeline. Our analyses show that agents excel in different behavioral stages, with user satisfaction shaped by both outcomes and intermediate behaviors. We also highlight future directions, including systems that leverage multiple agents and the limitations of user simulators in task planning.
翻译:大型语言模型(LLMs)在指令遵循和上下文理解方面日益增强的能力,引领了智能体在众多应用领域的时代。其中,任务规划智能体在涉及复杂内部流程(如上下文理解、工具管理和响应生成)的现实场景中变得尤为突出。然而,现有基准测试主要基于任务完成情况作为整体效能的代理指标来评估智能体性能。我们假设,仅仅提升任务完成率与最大化用户满意度并不一致,因为用户是与整个智能体交互过程互动,而不仅仅是最终结果。为弥补这一差距,我们提出了AURA(智能体-用户交互评估框架),该框架概念化了交互式任务规划智能体的行为阶段。AURA通过一套原子化的LLM评估标准,为智能体提供全面评估,使研究人员和从业者能够诊断智能体决策流程中的具体优势和弱点。我们的分析表明,智能体在不同行为阶段表现出色,用户满意度由结果和中间行为共同塑造。我们还指出了未来方向,包括利用多智能体的系统以及任务规划中用户模拟器的局限性。