多样性还是精确性？深入探究下一词元预测 (Diversity or Precision? A Deep Dive into Next Token Prediction)

Recent advancements have shown that reinforcement learning (RL) can substantially improve the reasoning abilities of large language models (LLMs). The effectiveness of such RL training, however, depends critically on the exploration space defined by the pre-trained model's token-output distribution. In this paper, we revisit the standard cross-entropy loss, interpreting it as a specific instance of policy gradient optimization applied within a single-step episode. To systematically study how the pre-trained distribution shapes the exploration potential for subsequent RL, we propose a generalized pre-training objective that adapts on-policy RL principles to supervised learning. By framing next-token prediction as a stochastic decision process, we introduce a reward-shaping strategy that explicitly balances diversity and precision. Our method employs a positive reward scaling factor to control probability concentration on ground-truth tokens and a rank-aware mechanism that treats high-ranking and low-ranking negative tokens asymmetrically. This allows us to reshape the pre-trained token-output distribution and investigate how to provide a more favorable exploration space for RL, ultimately enhancing end-to-end reasoning performance. Contrary to the intuition that higher distribution entropy facilitates effective exploration, we find that imposing a precision-oriented prior yields a superior exploration space for RL.

翻译：近期研究表明，强化学习（RL）能显著提升大语言模型（LLM）的推理能力。然而，此类RL训练的有效性关键取决于预训练模型的词元输出分布所定义的探索空间。本文重新审视标准交叉熵损失，将其解释为应用于单步回合的策略梯度优化的一个特例。为系统研究预训练分布如何影响后续RL的探索潜力，我们提出一种广义预训练目标，将同策略RL原则适配至监督学习。通过将下一词元预测构建为随机决策过程，我们引入一种显式平衡多样性与精确性的奖励塑形策略。该方法采用正奖励缩放因子控制对真实词元的概率集中度，并采用排名感知机制对高排名与低排名负样本词元进行非对称处理。这使我们能够重塑预训练的词元输出分布，并探究如何为RL提供更有利的探索空间，最终提升端到端推理性能。与更高分布熵有助于有效探索的直觉相反，我们发现施加精确性导向的先验能为RL提供更优越的探索空间。