Existing reinforcement learning (RL) approaches treat large language models (LLMs) as a single unified policy, overlooking their internal mechanisms. Understanding how policy evolves across layers and modules is therefore crucial for enabling more targeted optimization and raveling out complex reasoning mechanisms. In this paper, we decompose the language model policy by leveraging the intrinsic split of the Transformer residual stream and the equivalence between the composition of hidden states with the unembedding matrix and the resulting samplable policy. This decomposition reveals Internal Layer Policies, corresponding to contributions from individual layers, and Internal Modular Policies, which align with the self-attention and feed-forward network (FFN) components within each layer. By analyzing the entropy of internal policy, we find that: (a) Early layers keep high entropy for exploration, top layers converge to near-zero entropy for refinement, with convergence patterns varying across model series. (b) LLama's prediction space rapidly converges in the final layer, whereas Qwen-series models, especially Qwen3, exhibit a more human-like, progressively structured reasoning pattern. Motivated by these findings, we propose Bottom-up Policy Optimization (BuPO), a novel RL paradigm that directly optimizes the internal layer policy during early training. By aligning training objective at lower layer, BuPO reconstructs foundational reasoning capabilities and achieves superior performance. Extensive experiments on complex reasoning benchmarks demonstrates the effectiveness of our method. Our code is available at https://github.com/Trae1ounG/BuPO.
翻译:现有强化学习方法将大型语言模型视为单一统一策略,忽视了其内部机制。理解策略在不同层级和模块间的演化过程,对于实现更具针对性的优化和揭示复杂推理机制至关重要。本文通过利用Transformer残差流的固有分割特性,以及隐藏状态与解嵌入矩阵的组合等价于可采样策略的性质,对语言模型策略进行分解。该分解揭示了内部层级策略(对应各独立层级的贡献)和内部模块化策略(与每层中的自注意力机制和前馈网络组件对齐)。通过分析内部策略的熵,我们发现:(a)早期层级保持高熵以支持探索,顶层收敛至近零熵以进行精细化调整,且收敛模式在不同模型系列中存在差异;(b)LLama的预测空间在最终层快速收敛,而Qwen系列模型(尤其是Qwen3)展现出更接近人类、渐进结构化的推理模式。基于这些发现,我们提出自底向上策略优化——一种在早期训练阶段直接优化内部层级策略的新型强化学习范式。通过在底层对齐训练目标,BuPO重构了基础推理能力并实现了卓越性能。在复杂推理基准测试上的大量实验证明了我们方法的有效性。代码已发布于https://github.com/Trae1ounG/BuPO。