While Large Language Models (LLMs) have achieved remarkable success in cognitive and reasoning benchmarks, they exhibit a persistent deficit in anthropomorphic intelligence-the capacity to navigate complex social, emotional, and ethical nuances. This gap is particularly acute in the Chinese linguistic and cultural context, where a lack of specialized evaluation frameworks and high-quality socio-emotional data impedes progress. To address these limitations, we present HeartBench, a framework designed to evaluate the integrated emotional, cultural, and ethical dimensions of Chinese LLMs. Grounded in authentic psychological counseling scenarios and developed in collaboration with clinical experts, the benchmark is structured around a theory-driven taxonomy comprising five primary dimensions and 15 secondary capabilities. We implement a case-specific, rubric-based methodology that translates abstract human-like traits into granular, measurable criteria through a ``reasoning-before-scoring'' evaluation protocol. Our assessment of 13 state-of-the-art LLMs indicates a substantial performance ceiling: even leading models achieve only 60% of the expert-defined ideal score. Furthermore, analysis using a difficulty-stratified ``Hard Set'' reveals a significant performance decay in scenarios involving subtle emotional subtexts and complex ethical trade-offs. HeartBench establishes a standardized metric for anthropomorphic AI evaluation and provides a methodological blueprint for constructing high-quality, human-aligned training data.
翻译:尽管大型语言模型(LLMs)在认知与推理基准测试中取得了显著成就,但其在拟人智能——即处理复杂社会、情感及伦理细微差别的能力——方面仍存在持续缺陷。这一差距在中文语言文化语境中尤为突出,专业评估框架与高质量社会情感数据的缺乏阻碍了相关进展。为应对这些局限,我们提出HeartBench框架,旨在评估中文LLMs在情感、文化与伦理维度的综合表现。该基准植根于真实心理咨询场景,并与临床专家合作开发,其结构基于理论驱动的分类体系,涵盖5个主要维度与15项次级能力。我们采用案例导向的评分量规方法,通过“先推理后评分”的评估协议,将抽象的人类特质转化为可精细测量的标准。对13个前沿LLM的评估表明存在显著性能瓶颈:即使领先模型也仅达到专家定义理想得分的60%。此外,通过难度分层的“困难集”进行分析,发现在涉及微妙情感潜台词与复杂伦理权衡的场景中,模型性能出现显著衰减。HeartBench为拟人化AI评估建立了标准化度量标准,并为构建高质量、符合人类价值观的训练数据提供了方法论蓝图。