Large Language Models (LLMs) are increasingly explored for educational tasks such as grading, yet their alignment with human evaluation in real classrooms remains underexamined. In this study, we investigate the feasibility of using an LLM (GPT-4o) to evaluate short-answer quizzes and project reports in an undergraduate Computational Linguistics course. We collect responses from approximately 50 students across five quizzes and receive project reports from 14 teams. LLM-generated scores are compared against human evaluations conducted independently by the course teaching assistants (TAs). Our results show that GPT-4o achieves strong correlation with human graders (up to 0.98) and exact score agreement in 55\% of quiz cases. For project reports, it also shows strong overall alignment with human grading, while exhibiting some variability in scoring technical, open-ended responses. We release all code and sample data to support further research on LLMs in educational assessment. This work highlights both the potential and limitations of LLM-based grading systems and contributes to advancing automated grading in real-world academic settings.
翻译:大型语言模型(LLMs)在教育任务(如评分)中的应用日益受到关注,但其在真实课堂环境中与人类评估的一致性仍缺乏充分检验。本研究探讨了使用LLM(GPT-4o)对本科生计算语言学课程的简答题测验和项目报告进行评分的可行性。我们收集了约50名学生在五次测验中的回答,并收到了14个团队的项目报告。LLM生成的分数与课程助教独立进行的人工评估结果进行了比较。结果显示,GPT-4o与人类评分者具有高度相关性(最高达0.98),并在55%的测验案例中实现了完全一致的分数匹配。对于项目报告,LLM在整体上也与人工评分表现出较强的一致性,但在评估技术性、开放式回答时存在一定的评分波动性。我们公开了所有代码和样本数据,以支持LLM在教育评估领域的进一步研究。这项工作既揭示了基于LLM的评分系统的潜力,也指出了其局限性,并为推进真实学术场景中的自动化评分做出了贡献。