Large language models (LLMs) demonstrate significant reasoning capabilities, particularly through long chain-of-thought (CoT) processes, which can be elicited by reinforcement learning (RL). However, prolonged CoT reasoning presents limitations, primarily verbose outputs due to excessive introspection. The reasoning process in these LLMs often appears to follow a trial-and-error methodology rather than a systematic, logical deduction. In contrast, tree-of-thoughts (ToT) offers a conceptually more advanced approach by modeling reasoning as an exploration within a tree structure. This reasoning structure facilitates the parallel generation and evaluation of multiple reasoning branches, allowing for the active identification, assessment, and pruning of unproductive paths. This process can potentially lead to improved performance and reduced token costs. Building upon the long CoT capability of LLMs, we introduce tree-of-thoughts RL (ToTRL), a novel on-policy RL framework with a rule-based reward. ToTRL is designed to guide LLMs in developing the parallel ToT strategy based on the sequential CoT strategy. Furthermore, we employ LLMs as players in a puzzle game during the ToTRL training process. Solving puzzle games inherently necessitates exploring interdependent choices and managing multiple constraints, which requires the construction and exploration of a thought tree, providing challenging tasks for cultivating the ToT reasoning capability. Our empirical evaluations demonstrate that our ToTQwen3-8B model, trained with our ToTRL, achieves significant improvement in performance and reasoning efficiency on complex reasoning tasks.
翻译:大型语言模型(LLM)展现出显著的推理能力,尤其是通过长链思维(CoT)过程,该过程可通过强化学习(RL)激发。然而,过长的CoT推理存在局限性,主要表现为因过度内省而产生的冗长输出。这些LLM中的推理过程往往呈现为一种试错方法,而非系统性的逻辑演绎。相比之下,思维树(ToT)通过将推理建模为树结构内的探索,提供了一种概念上更先进的方法。这种推理结构促进了多个推理分支的并行生成与评估,从而能够主动识别、评估并剪除无效路径。该过程有望提升性能并降低令牌成本。基于LLM的长链CoT能力,我们提出了思维树强化学习(ToTRL),这是一种具有基于规则奖励的新型同策略RL框架。ToTRL旨在引导LLM基于序列化CoT策略发展并行ToT策略。此外,我们在ToTRL训练过程中将LLM用作谜题游戏的玩家。求解谜题游戏本质上需要探索相互依赖的选择并管理多重约束,这要求构建和探索思维树,从而为培养ToT推理能力提供了具有挑战性的任务。我们的实证评估表明,通过ToTRL训练的ToTQwen3-8B模型在复杂推理任务上实现了性能和推理效率的显著提升。