Large language models (LLMs) have advanced rapidly in recent years, driven by scale, abundant high-quality training data, and reinforcement learning. Yet this progress faces a fundamental bottleneck: the need for ever more data from which models can continue to learn. In this work, we propose a reinforcement learning approach that removes this dependency by enabling models to improve without additional data. Our method leverages a game-theoretic framework of self-play, where a model's capabilities are cast as performance in a competitive game and stronger policies emerge by having the model play against itself-a process we call Language Self-Play (LSP). Experiments with Llama-3.2-3B-Instruct on instruction-following, mathematics, and coding benchmarks show that pretrained models can be effectively improved with self-play alone.
翻译:近年来,大型语言模型(LLMs)在规模扩展、高质量训练数据丰富以及强化学习的推动下取得了飞速进展。然而,这一进展面临一个根本性瓶颈:模型持续学习需要不断获取更多数据。本研究提出一种强化学习方法,通过使模型能够在无需额外数据的情况下实现自我提升,从而消除这一依赖。我们的方法利用自博弈的博弈论框架,将模型能力视为其在竞争性博弈中的表现,并通过让模型与自身对弈——我们称之为语言自博弈(LSP)的过程——来产生更强的策略。基于Llama-3.2-3B-Instruct模型在指令遵循、数学和代码生成基准测试上的实验表明,仅通过自博弈即可有效提升预训练模型的性能。