While long Chain-of-Thought (CoT) distillation effectively transfers reasoning capability to smaller language models, the reasoning process often remains redundant and computational budget uncontrollable, leading to inefficient resource usage. To address this limitation, we propose \textbf{Budget-Aware Reasoning Distillation (BARD)}, a novel framework that simultaneously distills reasoning capability and enables fine-grained control over the reasoning length. BARD uses the thinking budget as a user-specified control signal, allowing the model to dynamically balance reasoning performance and computational efficiency. To achieve this concept, BARD introduces a two-phase training regimen. The first phase, Supervised Fine-Tuning (SFT) on teacher-generated long CoT data compressed to various budget levels, bootstrapping the model's understanding of budget constraints. The second phase leverages Reinforcement Learning (RL) from a reward signal in consideration of reasoning performance and budget fidelity simultaneously. Incorporating the two-phase regimen is crucial to avoiding policy degradation and ensuring that both objectives are optimized jointly. Extensive experiments demonstrate that our method empowers an 8B student model to achieve strong performance on challenging reasoning benchmarks (\textit{AIME24, AIME25, GPQA}) while providing precise and adaptive control over its reasoning length across a wide range of budgets.
翻译:尽管长链思维(CoT)蒸馏能有效将推理能力迁移至更小的语言模型,但其推理过程通常存在冗余且计算预算不可控,导致资源使用效率低下。为应对这一局限,我们提出**预算感知推理蒸馏(BARD)**,这是一个能同时蒸馏推理能力并实现对推理长度进行细粒度控制的新型框架。BARD将思维预算作为用户指定的控制信号,使模型能动态平衡推理性能与计算效率。为实现这一理念,BARD引入两阶段训练方案:第一阶段在教师生成的长链思维数据(压缩至不同预算水平)上进行监督微调(SFT),以此引导模型理解预算约束;第二阶段则利用强化学习(RL),通过同时考虑推理性能和预算保真度的奖励信号进行优化。整合这两阶段训练对于避免策略退化、确保两个目标协同优化至关重要。大量实验表明,我们的方法能使一个80亿参数的学生模型在具有挑战性的推理基准(AIME24、AIME25、GPQA)上取得强劲性能,同时能在广泛的预算范围内对其推理长度实现精确且自适应的控制。