每个提示都至关重要：在百亿规模MoE上实现无需浪费rollouts的强化学习扩展 (Each Prompt Matters: Scaling Reinforcement Learning Without Wasting Rollouts on Hundred-Billion-Scale MoE)

Anxiang Zeng,Haibo Zhang,Hailing Zhang,Kaixiang Mo,Liang Yao,Ling Hu,Long Zhang,Shuman Liu,Shuyi Xie,Yanshi Li,Yizhang Chen,Yuepeng Sheng,Yuwei Huang,Zhaochen Xu,Zhiqiang Zhou,Ziqin Liew

We present CompassMax-V3-Thinking, a hundred-billion-scale MoE reasoning model trained with a new RL framework built on one principle: each prompt must matter. Scaling RL to this size exposes critical inefficiencies-zero-variance prompts that waste rollouts, unstable importance sampling over long horizons, advantage inversion from standard reward models, and systemic bottlenecks in rollout processing. To overcome these challenges, we introduce several unified innovations: (1) Multi-Stage Zero-Variance Elimination, which filters out non-informative prompts and stabilizes group-based policy optimization (e.g. GRPO) by removing wasted rollouts; (2) ESPO, an entropy-adaptive optimization method that balances token-level and sequence-level importance sampling to maintain stable learning dynamics; (3) a Router Replay strategy that aligns training-time MoE router decisions with inference-time behavior to mitigate train-infer discrepancies, coupled with a reward model adjustment to prevent advantage inversion; (4) a high-throughput RL system with FP8-precision rollouts, overlapped reward computation, and length-aware scheduling to eliminate performance bottlenecks. Together, these contributions form a cohesive pipeline that makes RL on hundred-billion-scale MoE models stable and efficient. The resulting model delivers strong performance across both internal and public evaluations.

翻译：我们提出了CompassMax-V3-Thinking，这是一个百亿规模的混合专家（MoE）推理模型，采用基于一个核心原则构建的全新强化学习框架进行训练：每个提示都必须发挥作用。将强化学习扩展至这一规模暴露了关键的低效问题——零方差提示浪费rollouts、长时域上的不稳定重要性采样、标准奖励模型带来的优势反转，以及rollout处理过程中的系统性瓶颈。为克服这些挑战，我们引入了多项统一创新：（1）多阶段零方差消除，通过过滤非信息性提示并移除浪费的rollouts，稳定基于群体的策略优化（如GRPO）；（2）ESPO，一种熵自适应优化方法，平衡词元级和序列级重要性采样以维持稳定的学习动态；（3）路由器重放策略，将训练时MoE路由器的决策与推理时行为对齐，以减少训练-推理差异，并结合奖励模型调整以防止优势反转；（4）一个高吞吐量强化学习系统，采用FP8精度rollouts、重叠奖励计算和长度感知调度，以消除性能瓶颈。这些贡献共同构成了一个连贯的流程，使得在百亿规模MoE模型上进行强化学习既稳定又高效。最终模型在内部和公开评估中均展现出强劲性能。