Since the introduction of the GRPO algorithm, reinforcement learning (RL) has attracted increasing attention, with growing efforts to reproduce and apply it. However, training efficiency remains a critical challenge. In mainstream RL frameworks, inference and training are typically deployed on the same devices. While this approach reduces costs through resource consolidation, its synchronous execution imposes a computational coupling that prevents concurrent inference and training. In this study, we are returning to the strategy of separating inference and training deployment, and by introducing improvements in the data loader, we transform the conventional synchronous architecture into a periodically asynchronous framework, which allows for demand-driven, independent, and elastic scaling of each component, while the accuracy of the algorithm remains completely equivalent to the synchronization method, with both belonging to the on-policy strategy. It is worth emphasizing that we apply a unified tri-model architecture in the training phase, and we also proposed a shared-prompt attention mask to reduce repetitive computation. In practice, these works have achieved at least a threefold overall performance improvement in RL training on NPU platforms, indicating its potential for widespread application.
翻译:自GRPO算法提出以来,强化学习(RL)受到日益广泛的关注,相关复现与应用尝试不断增多。然而,训练效率仍是关键挑战。主流RL框架通常将推理与训练部署于同一设备。尽管该方法通过资源整合降低了成本,但其同步执行方式带来了计算耦合,导致推理与训练无法并发进行。本研究回归推理与训练分离部署的策略,并通过数据加载器的改进,将传统的同步架构转变为周期性异步框架。该框架允许各组件按需独立弹性伸缩,同时算法精度与同步方法完全等价,二者均属于同策略方法。值得强调的是,我们在训练阶段采用了统一的三模型架构,并提出了共享提示词注意力掩码以减少重复计算。实际应用中,这些工作在NPU平台上实现了RL训练整体性能至少三倍的提升,显示出其广泛的应用潜力。