The rapid rise of large language model (LLM)-based tutors in K--12 education has fostered a misconception that generative models can replace traditional learner modelling for adaptive instruction. This is especially problematic in K--12 settings, which the EU AI Act classifies as high-risk domain requiring responsible design. Motivated by these concerns, this study synthesises evidence on limitations of LLM-based tutors and empirically investigates one critical issue: the accuracy, reliability, and temporal coherence of assessing learners' evolving knowledge over time. We compare a deep knowledge tracing (DKT) model with a widely used LLM, evaluated zero-shot and fine-tuned, using a large open-access dataset. Results show that DKT achieves the highest discrimination performance (AUC = 0.83) on next-step correctness prediction and consistently outperforms the LLM across settings. Although fine-tuning improves the LLM's AUC by approximately 8\% over the zero-shot baseline, it remains 6\% below DKT and produces higher early-sequence errors, where incorrect predictions are most harmful for adaptive support. Temporal analyses further reveal that DKT maintains stable, directionally correct mastery updates, whereas LLM variants exhibit substantial temporal weaknesses, including inconsistent and wrong-direction updates. These limitations persist despite the fine-tuned LLM requiring nearly 198 hours of high-compute training, far exceeding the computational demands of DKT. Our qualitative analysis of multi-skill mastery estimation further shows that, even after fine-tuning, the LLM produced inconsistent mastery trajectories, while DKT maintained smooth and coherent updates. Overall, the findings suggest that LLMs alone are unlikely to match the effectiveness of established intelligent tutoring systems, and that responsible tutoring requires hybrid frameworks that incorporate learner modelling.
翻译:基于大型语言模型(LLM)的辅导系统在K-12教育中的迅速兴起,催生了一种误解,即生成模型可以取代传统的学习者建模来实现自适应教学。这在被欧盟《人工智能法案》列为高风险领域、需要负责任设计的K-12教育场景中尤为成问题。基于这些关切,本研究综合分析了基于LLM的辅导系统的局限性,并实证研究了一个关键问题:评估学习者随时间演化的知识时,其准确性、可靠性和时间连贯性。我们比较了一个深度知识追踪(DKT)模型与一个广泛使用的LLM(评估了其零样本和微调版本),使用了一个大型开放访问数据集。结果表明,在下一步正确性预测任务上,DKT取得了最高的区分性能(AUC = 0.83),并且在所有设置中均持续优于LLM。尽管微调使LLM的AUC比零样本基线提高了约8%,但仍比DKT低6%,并且产生了更高的早期序列错误——在自适应支持中,不正确的预测在此阶段危害最大。时间序列分析进一步揭示,DKT保持了稳定且方向正确的掌握度更新,而LLM变体则表现出显著的时间序列弱点,包括不一致和方向错误的更新。这些局限性持续存在,尽管微调后的LLM需要近198小时的高强度计算训练,远超DKT的计算需求。我们对多技能掌握度估计的定性分析进一步表明,即使经过微调,LLM仍产生了不一致的掌握度轨迹,而DKT则保持了平滑连贯的更新。总体而言,研究结果表明,仅凭LLM不太可能达到现有智能辅导系统的有效性,负责任的辅导需要融合学习者建模的混合框架。