Despite the growing promise of large language models (LLMs) in automatic essay scoring (AES), empirical findings regarding their reliability compared to human raters remain mixed. Following the PRISMA 2020 guidelines, we synthesized 65 published and unpublished studies from January 2022 to August 2025 that examined agreement between LLMs and human raters in AES. Across studies, reported LLM-human agreement was generally moderate to good, with agreement indices (e.g., Quadratic Weighted Kappa, Pearson correlation, and Spearman's rho) mostly ranging between 0.30 and 0.80. Substantial variability in agreement levels was observed across studies, reflecting differences in study-specific factors as well as the lack of standardized reporting practices. Implications and directions for future research are discussed.
翻译:尽管大型语言模型(LLMs)在自动作文评分(AES)中展现出日益广阔的应用前景,但关于其相对于人工评分者可靠性的实证研究结果仍存在分歧。遵循PRISMA 2020指南,我们系统综述了2022年1月至2025年8月期间发表的65项已发表及未发表研究,这些研究考察了AES中LLMs与人工评分者之间的一致性。综合各项研究,报告的LLM-人工一致性普遍处于中等至良好水平,一致性指标(如二次加权Kappa、皮尔逊相关系数和斯皮尔曼秩相关系数)大多介于0.30至0.80之间。不同研究间观察到一致性水平存在显著差异,这反映了研究特定因素的多样性以及标准化报告实践的缺失。本文进一步讨论了未来研究的意义与方向。