Group Relative Policy Optimization (GRPO) is a promising policy-based approach for Large Language Model alignment, yet its performance is often limited by training instability and suboptimal convergence. In this paper, we identify and analyze two main GRPO issues: (i) the token-level penalization, where valuable tokens shared across different responses receive contradictory feedback signals, leading to conflicting gradient updates that can reduce their likelihood; and (ii) the policy collapse, where negatively rewarded completions may penalize confident responses and shift model decisions toward unlikely tokens, destabilizing training process. To address these issues we introduce GTPO (Group-relative Trajectory-based Policy Optimization), which prevents conflicting gradients on valuable tokens by skipping negative updates while amplifying positive ones and filters out completions whose entropy exceeds a provable threshold, to prevent policy collapse. Unlike GRPO, GTPO does not rely on KL-divergence regularization, eliminating the need for a reference model during training, while still ensuring greater training stability and improved performance, as validated through multiple experiments on GSM8K, MATH, AIME 2024, AIME 2025 and AMC 2023.
翻译:组相对策略优化(GRPO)是一种有前景的基于策略的大语言模型对齐方法,但其性能常受限于训练不稳定性和次优收敛性。本文识别并分析了GRPO的两个主要问题:(i)令牌级惩罚问题,即不同响应间共享的有价值令牌接收到矛盾的反馈信号,导致冲突的梯度更新,可能降低其似然性;(ii)策略崩溃问题,即负奖励的补全可能惩罚置信响应,并将模型决策推向低概率令牌,从而破坏训练过程的稳定性。为解决这些问题,我们提出了GTPO(基于组相对轨迹的策略优化),该方法通过跳过负更新并放大正更新来避免有价值令牌上的梯度冲突,同时过滤掉熵超过可证明阈值的补全,以防止策略崩溃。与GRPO不同,GTPO不依赖于KL散度正则化,从而在训练过程中无需参考模型,同时仍能确保更高的训练稳定性和改进的性能,这在GSM8K、MATH、AIME 2024、AIME 2025和AMC 2023上的多项实验中得到验证。