迈向负责任的教育生成式人工智能发展：一种评估驱动的方法 (Towards Responsible Development of Generative AI for Education: An Evaluation-Driven Approach)

Irina Jurenka,Markus Kunesch,Kevin R. McKee,Daniel Gillick,Shaojian Zhu,Sara Wiltberger,Shubham Milind Phal,Katherine Hermann,Daniel Kasenberg,Avishkar Bhoopchand,Ankit Anand,Miruna Pîslar,Stephanie Chan,Lisa Wang,Jennifer She,Parsa Mahmoudieh,Aliya Rysbek,Wei-Jen Ko,Andrea Huber,Brett Wiltshire,Gal Elidan,Roni Rabin,Jasmin Rubinovitz,Amit Pitaru,Mac McAllister,Julia Wilkowski,David Choi,Roee Engelberg,Lidan Hackmon,Adva Levin,Rachel Griffin,Michael Sears,Filip Bar,Mia Mesar,Mana Jabbour,Arslan Chaudhry,James Cohan,Sridhar Thiagarajan,Nir Levine,Ben Brown,Dilan Gorur,Svetlana Grant,Rachel Hashimshoni,Laura Weidinger,Jieru Hu,Dawn Chen,Kuba Dolecki,Canfer Akbulut,Maxwell Bileschi,Laura Culp,Wen-Xin Dong,Nahema Marchal,Kelsie Van Deman,Hema Bajaj Misra,Michael Duah,Moran Ambar,Avi Caciularu,Sandra Lefdal,Chris Summerfield,James An,Pierre-Alexandre Kamienny,Abhinit Mohdi,Theofilos Strinopoulous,Annie Hale,Wayne Anderson,Luis C. Cobo,Niv Efron,Muktha Ananda,Shakir Mohamed,Maureen Heymans,Zoubin Ghahramani,Yossi Matias,Ben Gomes,Lila Ibrahim

A major challenge facing the world is the provision of equitable and universal access to quality education. Recent advances in generative AI (gen AI) have created excitement about the potential of new technologies to offer a personal tutor for every learner and a teaching assistant for every teacher. The full extent of this dream, however, has not yet materialised. We argue that this is primarily due to the difficulties with verbalising pedagogical intuitions into gen AI prompts and the lack of good evaluation practices, reinforced by the challenges in defining excellent pedagogy. Here we present our work collaborating with learners and educators to translate high level principles from learning science into a pragmatic set of seven diverse educational benchmarks, spanning quantitative, qualitative, automatic and human evaluations; and to develop a new set of fine-tuning datasets to improve the pedagogical capabilities of Gemini, introducing LearnLM-Tutor. Our evaluations show that LearnLM-Tutor is consistently preferred over a prompt tuned Gemini by educators and learners on a number of pedagogical dimensions. We hope that this work can serve as a first step towards developing a comprehensive educational evaluation framework, and that this can enable rapid progress within the AI and EdTech communities towards maximising the positive impact of gen AI in education.

翻译：当前世界面临的一个主要挑战是如何提供公平且普及的优质教育。近期生成式人工智能（gen AI）的进展引发了人们对于新技术潜力的兴奋，即有望为每位学习者提供个人导师，为每位教师提供教学助手。然而，这一愿景的完整实现尚未达成。我们认为，这主要源于将教学直觉转化为生成式AI提示的困难，以及缺乏良好的评估实践，而定义卓越教学法本身的挑战又加剧了这一问题。本文介绍了我们与学习者及教育工作者合作开展的工作：将学习科学中的高层原则转化为一套实用的、包含七项多样化教育基准的评估体系，涵盖定量、定性、自动化和人工评估；并开发了一套新的微调数据集以提升Gemini的教学能力，由此引入LearnLM-Tutor。我们的评估表明，在多项教学维度上，教育工作者和学习者一致更倾向于选择LearnLM-Tutor而非仅通过提示调优的Gemini。我们希望这项工作能够作为构建全面教育评估框架的第一步，并推动AI与教育科技社区快速进步，以最大化生成式AI在教育领域的积极影响。

相关内容

关注 0

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

什么是语言智能体？《语言智能体：人工智能的重要演化步骤》，54页ppt，OSU 助理教授Yu Su

专知会员服务

55+阅读 · 2023年9月9日

【机器学习傻瓜式入门，443页pdf】Machine Learning For Dummies, 2nd Edition

专知会员服务

71+阅读 · 2021年1月26日