Agentic Reinforcement Learning (RL) enables Large Language Models (LLMs) to perform autonomous decision-making and long-term planning. Unlike standard LLM post-training, agentic RL workloads are highly heterogeneous, combining compute-intensive prefill phases, bandwidth-bound decoding, and stateful, CPU-heavy environment simulations. We argue that efficient agentic RL training requires disaggregated infrastructure to leverage specialized, best-fit hardware. However, naive disaggregation introduces substantial synchronization overhead and resource underutilization due to the complex dependencies between stages. We present RollArc, a distributed system designed to maximize throughput for multi-task agentic RL on disaggregated infrastructure. RollArc is built on three core principles: (1) hardware-affinity workload mapping, which routes compute-bound and bandwidth-bound tasks to bestfit GPU devices, (2) fine-grained asynchrony, which manages execution at the trajectory level to mitigate resource bubbles, and (3) statefulness-aware computation, which offloads stateless components (e.g., reward models) to serverless infrastructure for elastic scaling. Our results demonstrate that RollArc effectively improves training throughput and achieves 1.35-2.05\(\times\) end-to-end training time reduction compared to monolithic and synchronous baselines. We also evaluate RollArc by training a hundreds-of-billions-parameter MoE model for Qoder product on an Alibaba cluster with more than 3,000 GPUs, further demonstrating RollArc scalability and robustness. The code is available at https://github.com/alibaba/ROLL.
翻译:智能体强化学习使大型语言模型能够执行自主决策与长期规划。与标准的大型语言模型后训练不同,智能体强化学习工作负载具有高度异构性,融合了计算密集型的前向填充阶段、带宽受限的解码过程,以及需要保持状态且对CPU负载较高的环境模拟。我们认为,高效的智能体强化学习训练需要解耦的基础设施以利用专业化、最适配的硬件。然而,由于各阶段间复杂的依赖关系,简单的解耦会带来显著的同步开销并导致资源利用率不足。本文提出RollArc,一个专为在解耦基础设施上实现多任务智能体强化学习最大吞吐量而设计的分布式系统。RollArc建立在三个核心原则之上:(1)硬件亲和性工作负载映射,将计算受限与带宽受限的任务路由至最适配的GPU设备;(2)细粒度异步执行,在轨迹级别管理执行以缓解资源空泡;(3)状态感知计算,将无状态组件(如奖励模型)卸载至无服务器基础设施以实现弹性伸缩。实验结果表明,相较于单体式与同步基线系统,RollArc能有效提升训练吞吐量,并实现1.35-2.05倍的端到端训练时间缩减。我们还在阿里巴巴集群上使用超过3000块GPU,通过训练一个数百亿参数的MoE模型用于Qoder产品,进一步验证了RollArc的可扩展性与鲁棒性。代码已开源:https://github.com/alibaba/ROLL。