Aligning the behavior of Large language models (LLMs) with human intentions and values remains a critical challenge. Reinforcement learning from human feedback (RLHF) aligns LLMs by training a reward model (RM) on human preferences and fine-tuning the LLMs to maximize RM feedback. Despite its effectiveness and popularity, RLHF is prone to biased local optimization. It means RM fails to provide feedback that accurately aligns with human preference, causing LLMs to explore unexpected generalizations, and failing to achieve alignment objectives. To mitigate this issue, we propose a novel \textit{sequence-to-sequence (seq2seq) reward modeling} method. Its key insight is that learning from language feedback rather than scalar feedback improves RLHF without additional annotations. We replaced the reward modeling target from binary maximum likelihood estimation (MLE) with sequence MLE. This method enables richer and fine-grained language feedback without additional annotations, models, or training stages. Our experiments demonstrated its effectiveness, specifically, reducing the refusal-to-response paradigm in single-turn safety dialogues and the long-response bias in text summarization tasks. We provide further analysis that seq2seq RM improves RLHF performance across 2B and 7B LLMs on 3 NLP tasks, achieving an average win rate of 76.9\%. We further show that seq2seq RM can still improve the performance of RLHF under out-of-distribution prompts.
翻译:将大型语言模型(LLM)的行为与人类意图及价值观对齐仍是一个关键挑战。基于人类反馈的强化学习(RLHF)通过基于人类偏好训练奖励模型(RM),并微调LLM以最大化RM反馈来实现对齐。尽管RLHF有效且流行,但它容易产生有偏的局部优化。这意味着RM无法提供与人类偏好精确对齐的反馈,导致LLM探索出意外的泛化结果,未能达成对齐目标。为缓解此问题,我们提出了一种新颖的\textit{序列到序列(seq2seq)奖励建模}方法。其核心洞见是:从语言反馈而非标量反馈中学习,可在无需额外标注的情况下改进RLHF。我们将奖励建模目标从二元最大似然估计(MLE)替换为序列MLE。该方法能够在无需额外标注、模型或训练阶段的情况下,实现更丰富且细粒度的语言反馈。实验证明了其有效性,具体表现为:在单轮安全对话中减少了拒绝-应答模式,并在文本摘要任务中缓解了长响应偏差。进一步分析表明,seq2seq RM在3项NLP任务上提升了2B和7B规模LLM的RLHF性能,平均胜率达到76.9\%。我们还证明,在分布外提示下,seq2seq RM仍能提升RLHF的性能。