We propose OutboundEval, a comprehensive benchmark for evaluating large language models (LLMs) in expert-level intelligent outbound calling scenarios. Unlike existing methods that suffer from three key limitations - insufficient dataset diversity and category coverage, unrealistic user simulation, and inaccurate evaluation metrics - OutboundEval addresses these issues through a structured framework. First, we design a benchmark spanning six major business domains and 30 representative sub-scenarios, each with scenario-specific process decomposition, weighted scoring, and domain-adaptive metrics. Second, we develop a large-model-driven User Simulator that generates diverse, persona-rich virtual users with realistic behaviors, emotional variability, and communication styles, providing a controlled yet authentic testing environment. Third, we introduce a dynamic evaluation method that adapts to task variations, integrating automated and human-in-the-loop assessment to measure task execution accuracy, professional knowledge application, adaptability, and user experience quality. Experiments on 12 state-of-the-art LLMs reveal distinct trade-offs between expert-level task completion and interaction fluency, offering practical insights for building reliable, human-like outbound AI systems. OutboundEval establishes a practical, extensible, and domain-oriented standard for benchmarking LLMs in professional applications.
翻译:我们提出了OutboundEval,一个用于在专家级智能外呼场景中评估大语言模型(LLM)的综合基准。与现有方法存在的三个关键局限——数据集多样性与类别覆盖不足、用户模拟不真实、评估指标不精确——不同,OutboundEval通过一个结构化框架解决了这些问题。首先,我们设计了一个涵盖六大业务领域和30个代表性子场景的基准,每个场景都包含场景特定的流程分解、加权评分和领域自适应指标。其次,我们开发了一个大模型驱动的用户模拟器,能够生成多样化、角色丰富的虚拟用户,这些用户具有真实的行为、情绪变化和沟通风格,提供了一个受控但真实的测试环境。第三,我们引入了一种动态评估方法,能够适应任务变化,整合了自动化评估和人机协同评估,以衡量任务执行准确性、专业知识应用能力、适应性以及用户体验质量。在12个最先进的大语言模型上的实验揭示了专家级任务完成度与交互流畅性之间存在的显著权衡,为构建可靠、类人的外呼AI系统提供了实用见解。OutboundEval为专业应用中的大语言模型基准测试建立了一个实用、可扩展且面向领域的新标准。