Intelligent tutoring systems have demonstrated effectiveness in teaching formal propositional logic proofs, but their reliance on template-based explanations limits their ability to provide personalized student feedback. While large language models (LLMs) offer promising capabilities for dynamic feedback generation, they risk producing hallucinations or pedagogically unsound explanations. We evaluated the stepwise accuracy of LLMs in constructing multi-step symbolic logic proofs, comparing six prompting techniques across four state-of-the-art LLMs on 358 propositional logic problems. Results show that DeepSeek-V3 achieved superior performance up to 86.7% accuracy on stepwise proof construction and excelled particularly in simpler rules. We further used the best-performing LLM to generate explanatory hints for 1,050 unique student problem-solving states from a logic ITS and evaluated them on 4 criteria with both an LLM grader and human expert ratings on a 20% sample. Our analysis finds that LLM-generated hints were 75% accurate and rated highly by human evaluators on consistency and clarity, but did not perform as well explaining why the hint was provided or its larger context. Our results demonstrate that LLMs may be used to augment tutoring systems with logic tutoring hints, but require additional modifications to ensure accuracy and pedagogical appropriateness.
翻译:智能辅导系统在教授形式命题逻辑证明方面已展现出有效性,但其对基于模板的解释的依赖限制了提供个性化学生反馈的能力。尽管大型语言模型(LLM)在动态反馈生成方面展现出有前景的能力,但它们存在产生幻觉或教学上不合理的解释的风险。我们评估了LLM在构建多步符号逻辑证明中的逐步准确性,在358个命题逻辑问题上比较了四种最先进LLM的六种提示技术。结果表明,DeepSeek-V3在逐步证明构建中达到了高达86.7%的准确率,表现优异,尤其在较简单的规则上。我们进一步使用性能最佳的LLM为来自逻辑ITS的1,050个独特学生问题解决状态生成解释性提示,并通过LLM评分器和人类专家对20%样本的评分,在四个标准上进行了评估。我们的分析发现,LLM生成的提示准确率为75%,在一致性和清晰度方面获得人类评估者的高度评价,但在解释为何提供该提示或其更大背景方面表现不佳。我们的结果表明,LLM可用于通过逻辑辅导提示增强辅导系统,但需要额外修改以确保准确性和教学适宜性。