In recent years, natural language processing (NLP) has become integral to educational data mining, particularly in the analysis of student-generated language products. For research and assessment purposes, so-called embedding models are typically employed to generate numeric representations of text that capture its semantic content for use in subsequent quantitative analyses. Yet when it comes to science-related language, symbolic expressions such as equations and formulas introduce challenges that current embedding models struggle to address. Existing research studies and practical applications often either overlook these challenges or remove symbolic expressions altogether, potentially leading to biased research findings and diminished performance of practical applications. This study therefore explores how contemporary embedding models differ in their capability to process and interpret science-related symbolic expressions. To this end, various embedding models are evaluated using physics-specific symbolic expressions drawn from authentic student responses, with performance assessed via two approaches: 1) similarity-based analyses and 2) integration into a machine learning pipeline. Our findings reveal significant differences in model performance, with OpenAI's GPT-text-embedding-3-large outperforming all other examined models, though its advantage over other models was moderate rather than decisive. Overall, this study underscores the importance for educational data mining researchers and practitioners of carefully selecting NLP embedding models when working with science-related language products that include symbolic expressions. The code and (partial) data are available at https://doi.org/10.17605/OSF.IO/6XQVG.
翻译:近年来,自然语言处理(NLP)已成为教育数据挖掘不可或缺的一部分,特别是在分析学生生成的语言产品方面。出于研究和评估目的,通常采用所谓的嵌入模型来生成文本的数值表示,以捕捉其语义内容,用于后续的定量分析。然而,当涉及科学相关语言时,诸如方程和公式之类的符号表达式带来了当前嵌入模型难以应对的挑战。现有的研究和实际应用往往忽视这些挑战,或完全移除符号表达式,这可能导致有偏见的研究结果和实际应用性能下降。因此,本研究探讨了当代嵌入模型在处理和解释科学相关符号表达式方面的能力差异。为此,我们使用从真实学生回答中提取的物理学特定符号表达式评估了多种嵌入模型,并通过两种方法评估其性能:1)基于相似性的分析,以及2)集成到机器学习流程中。我们的研究结果显示,模型性能存在显著差异,其中OpenAI的GPT-text-embedding-3-large模型优于所有其他被检模型,尽管其相对于其他模型的优势是适度的而非决定性的。总体而言,本研究强调了教育数据挖掘研究者和实践者在处理包含符号表达式的科学相关语言产品时,仔细选择NLP嵌入模型的重要性。代码和(部分)数据可在 https://doi.org/10.17605/OSF.IO/6XQVG 获取。