Reinforcement Learning (RL) has been shown to improve the capabilities of large language models (LLMs). However, applying RL to open-domain tasks faces two key challenges: (1) the inherent subjectivity of these tasks prevents the verifiable rewards as required by Reinforcement Learning with Verifiable Rewards (RLVR); (2) Reinforcement Learning from Human Feedback (RLHF) relies on external reward mechanisms. To overcome these limitations, we propose Self-Examining Reinforcement Learning (SERL), a novel self-improving framework where the LLM serves as both Actor and Judge. SERL introduces two synergistic reward mechanisms without any external signals. On the one hand, to improve the Actor's capability, we derive rewards from Copeland-style pairwise comparison judgments across a group of generated responses. On the other hand, a self-consistency reward that encourages coherent judgments is proposed to improve the Judge's reliability. This process refines the Judge's capability, which in turn provides a more robust reward for Actor. Experiments show that our method outperforms existing self-improvement training methods. SERL improves the LC win rate of Qwen3-8B on AlpacaEval 2 from 52.37% to 59.90%. To the best of our knowledge, our method achieves state-of-the-art performance among self-improving approaches. Furthermore, it achieves a performance comparable to significantly larger models like Qwen3-32B, demonstrating superior effectiveness and robustness on open-domain tasks.
翻译:强化学习(RL)已被证明能够提升大语言模型(LLM)的能力。然而,将RL应用于开放域任务面临两个关键挑战:(1)这些任务固有的主观性阻碍了可验证奖励的实现,而这正是可验证奖励强化学习(RLVR)所必需的;(2)基于人类反馈的强化学习(RLHF)依赖于外部奖励机制。为克服这些局限,我们提出自检式强化学习(SERL),一种新颖的自改进框架,其中LLM同时充当执行者(Actor)和评判者(Judge)。SERL引入了两种协同的奖励机制,无需任何外部信号。一方面,为提升执行者的能力,我们从一组生成响应的Copeland风格成对比较判断中推导奖励。另一方面,为提高评判者的可靠性,我们提出了一种鼓励一致判断的自洽奖励。这一过程优化了评判者的能力,进而为执行者提供更稳健的奖励。实验表明,我们的方法优于现有的自改进训练方法。SERL将Qwen3-8B在AlpacaEval 2上的LC胜率从52.37%提升至59.90%。据我们所知,该方法在自改进方法中达到了最先进的性能。此外,其性能可与Qwen3-32B等显著更大的模型相媲美,在开放域任务上展现出卓越的有效性和鲁棒性。