How can we test whether state-of-the-art generative models, such as Blender and GPT-3, are good AI teachers, capable of replying to a student in an educational dialogue? Designing an AI teacher test is challenging: although evaluation methods are much-needed, there is no off-the-shelf solution to measuring pedagogical ability. This paper reports on a first attempt at an AI teacher test. We built a solution around the insight that you can run conversational agents in parallel to human teachers in real-world dialogues, simulate how different agents would respond to a student, and compare these counterpart responses in terms of three abilities: speak like a teacher, understand a student, help a student. Our method builds on the reliability of comparative judgments in education and uses a probabilistic model and Bayesian sampling to infer estimates of pedagogical ability. We find that, even though conversational agents (Blender in particular) perform well on conversational uptake, they are quantifiably worse than real teachers on several pedagogical dimensions, especially with regard to helpfulness (Blender: {\Delta} ability = -0.75; GPT-3: {\Delta} ability = -0.93).
翻译:我们如何测试最先进的基因模型,如Blender和GPT-3, 能够对学生在教育对话中做出回应的AI教师是否是优秀的AI教师?设计AI教师测试具有挑战性:虽然评估方法非常需要,但衡量教学能力没有现成的解决方案。本文报告了首次尝试AI教师测试的情况。我们围绕在现实世界对话中你可以与人类教师同时运行对话代理人的洞察力建立了一个解决方案,模拟不同代理人如何对学生作出反应,并以三种能力比较这些对应的对应反应:像教师一样说话,理解学生,帮助学生。我们的方法建立在教育比较判断的可靠性的基础上,并且使用概率模型和巴耶斯抽样来推断教学能力的估算。我们发现,尽管谈话代理人(特别是Blender)在对话中表现良好,但在多个教学层面,特别是在帮助性方面,它们比真正的教师还要糟糕,可以量化,特别是(Blender: =-7.75;GPT-3:=0.93) 能力:GPT=0.93。