AnesSuite：面向大型语言模型麻醉学推理的综合基准与数据集套件 (AnesSuite: A Comprehensive Benchmark and Dataset Suite for Anesthesiology Reasoning in LLMs)

The application of large language models (LLMs) in the medical field has garnered significant attention, yet their reasoning capabilities in more specialized domains like anesthesiology remain underexplored. To bridge this gap, we introduce AnesSuite, the first comprehensive dataset suite specifically designed for anesthesiology reasoning in LLMs. The suite features AnesBench, an evaluation benchmark tailored to assess anesthesiology-related reasoning across three levels: factual retrieval (System 1), hybrid reasoning (System 1.x), and complex decision-making (System 2). Alongside this benchmark, the suite includes three training datasets that provide an infrastructure for continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning with verifiable rewards (RLVR). Leveraging this suite, we develop Morpheus, the first baseline model collection for anesthesiology reasoning. Despite undergoing limited training with SFT and group relative policy optimization (GRPO), Morpheus demonstrates substantial performance improvements, rivaling the performance of larger-scale models. Furthermore, through comprehensive evaluations and experiments, we analyze the key factors influencing anesthesiology reasoning performance, including model characteristics, training strategies and training data. Both AnesSuite and Morpheus will be open-sourced at https://github.com/MiliLab/AnesSuite.

翻译：大型语言模型（LLM）在医学领域的应用已引起广泛关注，但其在麻醉学等专业领域的推理能力仍未得到充分探索。为填补这一空白，我们推出了首个专门针对LLM麻醉学推理设计的综合数据集套件AnesSuite。该套件包含AnesBench评估基准，专门用于评估三个层次的麻醉学相关推理能力：事实检索（系统1）、混合推理（系统1.x）与复杂决策（系统2）。除基准外，本套件还包含三个训练数据集，为持续预训练、监督微调及可验证奖励的强化学习提供了基础设施。基于此套件，我们开发了首个麻醉学推理基线模型集合Morpheus。尽管仅经过监督微调和群体相对策略优化的有限训练，Morpheus仍展现出显著的性能提升，其表现可与更大规模模型相媲美。此外，通过系统评估与实验，我们分析了影响麻醉学推理性能的关键因素，包括模型特性、训练策略与训练数据。AnesSuite与Morpheus均已在https://github.com/MiliLab/AnesSuite开源。