With generative artificial intelligence driving the growth of dialogic data in education, automated coding is a promising direction for learning analytics to improve efficiency. This surge highlights the need to understand the nuances of student-AI interactions, especially those rare yet crucial. However, automated coding may struggle to capture these rare codes due to imbalanced data, while human coding remains time-consuming and labour-intensive. The current study examined the potential of large language models (LLMs) to approximate or replace humans in deductive, theory-driven coding, while also exploring how human-AI collaboration might support such coding tasks at scale. We compared the coding performance of small transformer classifiers (e.g., BERT) and LLMs in two datasets, with particular attention to imbalanced head-tail distributions in dialogue codes. Our results showed that LLMs did not outperform BERT-based models and exhibited systematic errors and biases in deductive coding tasks. We designed and evaluated a human-AI collaborative workflow that improved coding efficiency while maintaining coding reliability. Our findings reveal both the limitations of LLMs -- especially their difficulties with semantic similarity and theoretical interpretations and the indispensable role of human judgment -- while demonstrating the practical promise of human-AI collaborative workflows for coding.
翻译:随着生成式人工智能推动教育领域对话数据的增长,自动化编码成为学习分析提高效率的一个有前景的方向。这一增长凸显了理解学生与人工智能互动细微差别(尤其是那些罕见但关键的互动)的必要性。然而,由于数据不平衡,自动化编码可能难以捕捉这些罕见编码,而人工编码仍然耗时费力。本研究考察了大语言模型在演绎性、理论驱动的编码中替代或近似人类编码的潜力,同时探索了人机协作如何支持大规模此类编码任务。我们在两个数据集上比较了小型Transformer分类器(如BERT)与大语言模型的编码性能,特别关注了对话编码中头尾分布不平衡的问题。我们的结果表明,大语言模型并未优于基于BERT的模型,并且在演绎编码任务中表现出系统性错误和偏差。我们设计并评估了一种人机协作工作流,该工作流在保持编码可靠性的同时提高了编码效率。我们的研究结果既揭示了大语言模型的局限性——特别是其在语义相似性和理论解释方面的困难,以及人类判断不可或缺的作用——同时也展示了人机协作编码工作流程的实际应用前景。