A critical gap exists in LLM task-specific benchmarks. Thermal comfort, a sophisticated interplay of environmental factors and personal perceptions involving sensory integration and adaptive decision-making, serves as an ideal paradigm for evaluating real-world cognitive capabilities of AI systems. To address this, we propose TCEval, the first evaluation framework that assesses three core cognitive capacities of AI, cross-modal reasoning, causal association, and adaptive decision-making, by leveraging thermal comfort scenarios and large language model (LLM) agents. The methodology involves initializing LLM agents with virtual personality attributes, guiding them to generate clothing insulation selections and thermal comfort feedback, and validating outputs against the ASHRAE Global Database and Chinese Thermal Comfort Database. Experiments on four LLMs show that while agent feedback has limited exact alignment with humans, directional consistency improves significantly with a 1 PMV tolerance. Statistical tests reveal that LLM-generated PMV distributions diverge markedly from human data, and agents perform near-randomly in discrete thermal comfort classification. These results confirm the feasibility of TCEval as an ecologically valid Cognitive Turing Test for AI, demonstrating that current LLMs possess foundational cross-modal reasoning ability but lack precise causal understanding of the nonlinear relationships between variables in thermal comfort. TCEval complements traditional benchmarks, shifting AI evaluation focus from abstract task proficiency to embodied, context-aware perception and decision-making, offering valuable insights for advancing AI in human-centric applications like smart buildings.
翻译:当前大语言模型任务特定基准测试存在关键空白。热舒适性作为环境因素与个人感知(涉及感官整合与适应性决策)的复杂交互作用,为评估人工智能系统的真实世界认知能力提供了理想范式。为此,我们提出TCEval——首个通过热舒适场景与大语言模型智能体,评估人工智能三项核心认知能力(跨模态推理、因果关联与适应性决策)的评估框架。该方法包括:为大语言模型智能体初始化虚拟人格属性,引导其生成服装热阻选择与热舒适反馈,并依据ASHRAE全球数据库与中国热舒适数据库验证输出结果。在四个大语言模型上的实验表明,虽然智能体反馈与人类数据的精确对齐有限,但在1 PMV容差范围内方向一致性显著提升。统计检验揭示,大语言模型生成的PMV分布与人类数据存在显著差异,且智能体在离散热舒适分类任务中表现接近随机水平。这些结果证实了TCEval作为生态效度良好的人工智能认知图灵测试的可行性,表明当前大语言模型具备基础的跨模态推理能力,但缺乏对热舒适中变量间非线性关系的精确因果理解。TCEval对传统基准测试形成补充,将人工智能评估重点从抽象任务熟练度转向具身化、情境感知的感知与决策,为推进人工智能在智慧建筑等以人为本的应用领域提供了重要洞见。