While rapid advances in large language models (LLMs) are reshaping data-driven intelligent education, accurately simulating students remains an important but challenging bottleneck for scalable educational data collection, evaluation, and intervention design. However, current works are limited by scarce real interaction data, costly expert evaluation for realism, and a lack of large-scale, systematic analyses of LLMs ability in simulating students. We address this gap by presenting a three-stage LLM-human collaborative pipeline to automatically generate and filter high-quality student agents. We leverage a two-round automated scoring validated by human experts and deploy a score propagation module to obtain more consistent scores across the student similarity graph. Experiments show that combining automated scoring, expert calibration, and graph-based propagation yields simulated student that more closely track authentication by human judgments. We then analyze which profiles and behaviors are simulated more faithfully, supporting subsequent studies on personalized learning and educational assessment.
翻译:尽管大语言模型(LLMs)的快速发展正在重塑数据驱动的智能教育,但准确模拟学生仍然是实现可扩展教育数据收集、评估与干预设计的重要且具有挑战性的瓶颈。然而,现有研究受限于真实交互数据稀缺、真实性专家评估成本高昂,以及缺乏对LLMs模拟学生能力的大规模系统性分析。为填补这一空白,我们提出一个三阶段的人机协作流程,以自动生成并筛选高质量的学生智能体。我们采用经专家验证的两轮自动评分机制,并部署分数传播模块,以在学生相似性图上获得更一致的评分。实验表明,结合自动评分、专家校准与基于图的传播方法,所模拟的学生能更贴近人类判断的真实性标准。随后,我们分析了哪些学生画像与行为能被更准确地模拟,为后续个性化学习与教育评估研究提供支持。