LLM作为评分者：大型语言模型在简答题与报告评估中的实践洞察 (LLM-as-a-Grader: Practical Insights from Large Language Model for Short-Answer and Report Evaluation)

Large Language Models (LLMs) are increasingly explored for educational tasks such as grading, yet their alignment with human evaluation in real classrooms remains underexamined. In this study, we investigate the feasibility of using an LLM (GPT-4o) to evaluate short-answer quizzes and project reports in an undergraduate Computational Linguistics course. We collect responses from approximately 50 students across five quizzes and receive project reports from 14 teams. LLM-generated scores are compared against human evaluations conducted independently by the course teaching assistants (TAs). Our results show that GPT-4o achieves strong correlation with human graders (up to 0.98) and exact score agreement in 55\% of quiz cases. For project reports, it also shows strong overall alignment with human grading, while exhibiting some variability in scoring technical, open-ended responses. We release all code and sample data to support further research on LLMs in educational assessment. This work highlights both the potential and limitations of LLM-based grading systems and contributes to advancing automated grading in real-world academic settings.

翻译：大型语言模型（LLMs）在教育任务（如评分）中的应用日益受到关注，但其在真实课堂环境中与人工评估的一致性仍有待深入探究。本研究探讨了使用LLM（GPT-4o）对本科生计算语言学课程中的简答题测验与项目报告进行评分的可行性。我们收集了约50名学生在五次测验中的回答，并收到了14个团队的项目报告。LLM生成的分数与课程助教独立进行的人工评估结果进行了比较。结果显示，GPT-4o与人工评分者表现出高度相关性（最高达0.98），并在55%的测验案例中实现了完全一致的分数匹配。对于项目报告，该模型在整体上也与人工评分高度一致，但在技术性、开放式回答的评分中表现出一定的波动性。我们公开了所有代码与示例数据，以支持LLM在教育评估领域的进一步研究。这项工作既揭示了基于LLM的评分系统的潜力，也指出了其局限性，并为推动真实学术场景中的自动化评分发展提供了参考。