Course evaluation plays a critical role in ensuring instructional quality and guiding curriculum development in higher education. However, traditional evaluation methods, such as student surveys, classroom observations, and expert reviews, are often constrained by subjectivity, high labor costs, and limited scalability. With recent advancements in large language models (LLMs), new opportunities have emerged for generating consistent, fine-grained, and scalable course evaluations. This study investigates the use of three representative LLMs for automated course evaluation at both the micro level (classroom discussion analysis) and the macro level (holistic course review). Using classroom interaction transcripts and a dataset of 100 courses from a major institution in China, we demonstrate that LLMs can extract key pedagogical features and generate structured evaluation results aligned with expert judgement. A fine-tuned version of Llama shows superior reliability, producing score distributions with greater differentiation and stronger correlation with human evaluators than its counterparts. The results highlight three major findings: (1) LLMs can reliably perform systematic and interpretable course evaluations at both the micro and macro levels; (2) fine-tuning and prompt engineering significantly enhance evaluation accuracy and consistency; and (3) LLM-generated feedback provides actionable insights for teaching improvement. These findings illustrate the promise of LLM-based evaluation as a practical tool for supporting quality assurance and educational decision-making in large-scale higher education settings.
翻译:课程评估在保障高等教育教学质量与指导课程建设方面发挥着关键作用。然而,传统的评估方法,如学生问卷调查、课堂观察和专家评审,常受限于主观性、高昂的人力成本以及有限的可扩展性。随着大语言模型(LLMs)的最新进展,为生成一致、细粒度且可扩展的课程评估带来了新的机遇。本研究探讨了使用三种代表性LLMs在微观层面(课堂讨论分析)和宏观层面(整体课程评价)进行自动化课程评估的应用。利用课堂互动转录文本和来自中国一所主要机构的100门课程数据集,我们证明LLMs能够提取关键教学特征,并生成与专家判断一致的结构化评估结果。经过微调的Llama版本展现出更优的可靠性,其生成的分数分布具有更高的区分度,且与人类评估者的相关性更强。结果突出了三项主要发现:(1)LLMs能够在微观和宏观层面可靠地进行系统化且可解释的课程评估;(2)微调与提示工程显著提升了评估的准确性与一致性;(3)LLM生成的反馈为教学改进提供了可操作的见解。这些发现表明,基于LLM的评估作为一种实用工具,在大规模高等教育环境中支持质量保障与教育决策具有广阔前景。