A major challenge facing the world is the provision of equitable and universal access to quality education. Recent advances in generative AI (gen AI) have created excitement about the potential of new technologies to offer a personal tutor for every learner and a teaching assistant for every teacher. The full extent of this dream, however, has not yet materialised. We argue that this is primarily due to the difficulties with verbalising pedagogical intuitions into gen AI prompts and the lack of good evaluation practices, reinforced by the challenges in defining excellent pedagogy. Here we present our work collaborating with learners and educators to translate high level principles from learning science into a pragmatic set of seven diverse educational benchmarks, spanning quantitative, qualitative, automatic and human evaluations; and to develop a new set of fine-tuning datasets to improve the pedagogical capabilities of Gemini, introducing LearnLM-Tutor. Our evaluations show that LearnLM-Tutor is consistently preferred over a prompt tuned Gemini by educators and learners on a number of pedagogical dimensions. We hope that this work can serve as a first step towards developing a comprehensive educational evaluation framework, and that this can enable rapid progress within the AI and EdTech communities towards maximising the positive impact of gen AI in education.
翻译:当前世界面临的一个主要挑战是如何提供公平且普及的优质教育。近期生成式人工智能(gen AI)的进展引发了人们对于新技术潜力的兴奋,即有望为每位学习者提供个人导师,为每位教师提供教学助手。然而,这一愿景的完整实现尚未达成。我们认为,这主要源于将教学直觉转化为生成式AI提示的困难,以及缺乏良好的评估实践,而定义卓越教学法本身的挑战又加剧了这一问题。本文介绍了我们与学习者及教育工作者合作开展的工作:将学习科学中的高层原则转化为一套实用的、包含七项多样化教育基准的评估体系,涵盖定量、定性、自动化和人工评估;并开发了一套新的微调数据集以提升Gemini的教学能力,由此引入LearnLM-Tutor。我们的评估表明,在多项教学维度上,教育工作者和学习者一致更倾向于选择LearnLM-Tutor而非仅通过提示调优的Gemini。我们希望这项工作能够作为构建全面教育评估框架的第一步,并推动AI与教育科技社区快速进步,以最大化生成式AI在教育领域的积极影响。