Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising paradigm for post-training large language models (LLMs) on complex reasoning tasks. Yet, the conditions under which RLVR yields robust generalization remain poorly understood. This paper provides an empirical study of RLVR generalization in the setting of probabilistic inference over causal graphical models. This setting offers two natural axes along which to examine generalization: (i) the level of the probabilistic query -- associational, interventional, or counterfactual -- and (ii) the structural complexity of the query, measured by the size of its relevant subgraph. We construct datasets of causal graphs and queries spanning these difficulty axes and fine-tune Qwen-2.5-Instruct models using RLVR or supervised fine-tuning (SFT). We vary both the model scale (3B-32B) and the query level included in training. We find that RLVR yields stronger within-level and across-level generalization than SFT, but only for specific combinations of model size and training query level. Further analysis shows that RLVR's effectiveness depends on the model's initial reasoning competence. With sufficient initial competence, RLVR improves an LLM's marginalization strategy and reduces errors in intermediate probability calculations, producing substantial accuracy gains, particularly on more complex queries. These findings show that RLVR can improve specific causal reasoning subskills, with its benefits emerging only when the model has sufficient initial competence.
翻译:具有可验证奖励的强化学习(RLVR)已成为在复杂推理任务上对大型语言模型(LLM)进行后训练的一种有前景的范式。然而,RLVR在何种条件下能产生稳健的泛化能力仍知之甚少。本文对RLVR在因果图模型上的概率推理场景中的泛化能力进行了实证研究。该场景提供了两个自然的维度来检验泛化能力:(i)概率查询的层级——关联性、干预性或反事实性;(ii)查询的结构复杂性,以其相关子图的大小来衡量。我们构建了涵盖这些难度维度的因果图和查询数据集,并使用RLVR或监督微调(SFT)对Qwen-2.5-Instruct模型进行微调。我们同时改变了模型规模(3B-32B)和训练中包含的查询层级。研究发现,RLVR比SFT产生了更强的层级内和跨层级泛化能力,但这仅出现在特定的模型规模和训练查询层级的组合下。进一步分析表明,RLVR的有效性取决于模型的初始推理能力。在具备足够初始能力的情况下,RLVR改进了LLM的边缘化策略,并减少了中间概率计算中的错误,从而带来了显著的准确率提升,尤其是在更复杂的查询上。这些发现表明,RLVR可以改进特定的因果推理子技能,但其益处仅在模型具备足够初始能力时才会显现。