Training large language models with Reinforcement Learning with Verifiable Rewards (RLVR) exhibits a set of distinctive and puzzling behaviors that remain poorly understood, including a two-stage learning curve, a V-shaped response-length trajectory, and a pronounced vulnerability to catastrophic forgetting. In this work, we propose that these behaviors are emergent collective phenomena governed not by neural implementation details, but by the topological evolution of the latent reasoning graph in semantic space. By demonstrating a dynamical isomorphism between a 1.5B-parameter LLM and a minimal Concept Network Model (CoNet), we trace the causal source to the self-organization of a sparse concept web pinned to an average degree of two. This geometric perspective provides a unified physical explanation for the observed anomalies: the V-shaped trajectory tracks the evolution from parallel local skill optimization to global network integration; catastrophic forgetting stems from the topological disconnection of critical ``trunk'' edges; and policy collapse arises from the accumulation of sequential transitions at the web's leaf nodes, where broad exploration abruptly freezes into rigid, high-reward trajectories. Identifying a ``maximally frustrated state'' at the transition between learning stages, we propose Annealed-RLVR, a principled algorithm that injects a targeted SFT ``heating'' step to resolve this topological bottleneck. Experiments confirm that this theory-driven intervention outperforms standard RLVR on both in-distribution and out-of-distribution benchmarks (including Minerva and AIME). By recasting RLVR from black-box optimization into a predictable process of structural self-organization, our work provides a new physical intuition for engineering the emergent reasoning capabilities of future AI systems.
翻译:采用可验证奖励强化学习(RLVR)训练大语言模型时,出现了一系列独特且令人困惑的行为,这些行为至今仍未得到充分理解,包括两阶段学习曲线、V型响应长度轨迹以及对灾难性遗忘的显著脆弱性。本研究提出,这些行为是由潜在推理图在语义空间中的拓扑演化所支配的涌现集体现象,而非神经实现细节所致。通过展示一个15亿参数大语言模型与一个最小概念网络模型(CoNet)之间的动力学同构性,我们将因果根源追溯至一个平均度为二的稀疏概念网络的自组织过程。这一几何视角为观察到的异常现象提供了统一的物理解释:V型轨迹反映了从并行局部技能优化到全局网络整合的演化过程;灾难性遗忘源于关键“主干”边的拓扑断开;策略崩溃则源于网络叶节点处顺序转移的累积,导致广泛探索突然冻结为僵化的高奖励轨迹。在识别学习阶段转换处的“最大受挫态”后,我们提出了退火RLVR这一原则性算法,通过注入有针对性的监督微调“加热”步骤来解决该拓扑瓶颈。实验证实,这一理论驱动的干预方法在分布内和分布外基准测试(包括Minerva和AIME)上均优于标准RLVR。通过将RLVR从黑箱优化重新定义为结构自组织的可预测过程,我们的研究为未来人工智能系统涌现推理能力的工程化提供了新的物理直觉。