当大语言模型在演绎编码中表现不足：模型比较与人机协作工作流设计 (When LLMs fall short in Deductive Coding: Model Comparison and Human AI Collaboration Workflow Design)

With generative artificial intelligence driving the growth of dialogic data in education, automated coding is a promising direction for learning analytics to improve efficiency. This surge highlights the need to understand the nuances of student-AI interactions, especially those rare yet crucial. However, automated coding may struggle to capture these rare codes due to imbalanced data, while human coding remains time-consuming and labour-intensive. The current study examined the potential of large language models (LLMs) to approximate or replace humans in deductive, theory-driven coding, while also exploring how human-AI collaboration might support such coding tasks at scale. We compared the coding performance of small transformer classifiers (e.g., BERT) and LLMs in two datasets, with particular attention to imbalanced head-tail distributions in dialogue codes. Our results showed that LLMs did not outperform BERT-based models and exhibited systematic errors and biases in deductive coding tasks. We designed and evaluated a human-AI collaborative workflow that improved coding efficiency while maintaining coding reliability. Our findings reveal both the limitations of LLMs -- especially their difficulties with semantic similarity and theoretical interpretations and the indispensable role of human judgment -- while demonstrating the practical promise of human-AI collaborative workflows for coding.

翻译：随着生成式人工智能推动教育领域对话数据的增长，自动化编码成为学习分析提高效率的一个有前景的方向。这一增长凸显了理解学生与人工智能互动细微差别（尤其是那些罕见但关键的互动）的必要性。然而，由于数据不平衡，自动化编码可能难以捕捉这些罕见编码，而人工编码仍然耗时费力。本研究考察了大语言模型在演绎性、理论驱动的编码中替代或近似人类编码的潜力，同时探索了人机协作如何支持大规模此类编码任务。我们在两个数据集上比较了小型Transformer分类器（如BERT）与大语言模型的编码性能，特别关注了对话编码中头尾分布不平衡的问题。我们的结果表明，大语言模型并未优于基于BERT的模型，并且在演绎编码任务中表现出系统性错误和偏差。我们设计并评估了一种人机协作工作流，该工作流在保持编码可靠性的同时提高了编码效率。我们的研究结果既揭示了大语言模型的局限性——特别是其在语义相似性和理论解释方面的困难，以及人类判断不可或缺的作用——同时也展示了人机协作编码工作流程的实际应用前景。