Progress in long-context reasoning for large language models (LLMs) has lagged behind other recent advances. This gap arises not only from the intrinsic difficulty of processing long texts, but also from the scarcity of reliable human annotations and programmatically verifiable reward signals. In this paper, we propose SPELL, a multi-role self-play reinforcement learning framework that enables scalable, label-free optimization for long-context reasoning. SPELL integrates three cyclical roles-questioner, responder, and verifier-within a single model to enable continual self-improvement. The questioner generates questions from raw documents paired with reference answers; the responder learns to solve these questions based on the documents; and the verifier evaluates semantic equivalence between the responder's output and the questioner's reference answer, producing reward signals to guide continual training. To stabilize training, we introduce an automated curriculum that gradually increases document length and a reward function that adapts question difficulty to the model's evolving capabilities. Extensive experiments on six long-context benchmarks show that SPELL consistently improves performance across diverse LLMs and outperforms equally sized models fine-tuned on large-scale annotated data. Notably, SPELL achieves an average 7.6-point gain in pass@8 on the strong reasoning model Qwen3-30B-A3B-Thinking, raising its performance ceiling and showing promise for scaling to even more capable models.
翻译:大型语言模型(LLM)在长上下文推理方面的进展滞后于其他近期突破。这一差距不仅源于处理长文本的内在困难,也源于可靠人工标注与可编程验证奖励信号的稀缺性。本文提出SPELL——一种多角色自我博弈强化学习框架,能够实现可扩展、无标注的长上下文推理优化。SPELL在单一模型中整合了提问者、应答者和验证者三个循环角色,以实现持续自我改进:提问者从原始文档及参考答案生成问题;应答者学习基于文档解决这些问题;验证者评估应答者输出与提问者参考答案间的语义等价性,产生指导持续训练的奖励信号。为稳定训练,我们引入了自动课程学习机制以逐步增加文档长度,并设计了根据模型能力动态调整问题难度的奖励函数。在六个长上下文基准测试上的大量实验表明,SPELL能持续提升不同LLM的性能,且优于基于大规模标注数据微调的同规模模型。值得注意的是,SPELL在强推理模型Qwen3-30B-A3B-Thinking上实现了平均7.6分的pass@8提升,突破了其性能上限,并展现出向更强大模型扩展的潜力。