A generalizable reward model is crucial in Reinforcement Learning from Human Feedback (RLHF) as it enables correctly evaluating unseen prompt-response pairs. However, existing reward models lack this ability, as they are typically trained by increasing the reward gap between chosen and rejected responses, while overlooking the prompts that the responses are conditioned on. Consequently, when the trained reward model is evaluated on prompt-response pairs that lie outside the data distribution, neglecting the effect of prompts may result in poor generalization of the reward model. To address this issue, we decompose the reward value into two independent components: prompt-free reward and prompt-related reward. Prompt-free reward represents the evaluation that is determined only by responses, while the prompt-related reward reflects the reward that derives from both the prompt and the response. We extract these two components from an information-theoretic perspective, which requires no extra models. Subsequently, we propose a new reward learning algorithm by prioritizing data samples based on their prompt-free reward values. Through toy examples, we demonstrate that the extracted prompt-free and prompt-related rewards effectively characterize two parts of the reward model. Further, standard evaluations show that our method improves both the alignment performance and the generalization capability of the reward model.
翻译:在基于人类反馈的强化学习(RLHF)中,可泛化的奖励模型至关重要,因为它能够正确评估未见过的提示-响应对。然而,现有的奖励模型缺乏这种能力,因为它们通常通过增大被选择响应与被拒绝响应之间的奖励差距进行训练,而忽略了响应所依赖的提示。因此,当训练好的奖励模型在数据分布之外的提示-响应对上进行评估时,忽视提示的影响可能导致奖励模型的泛化性能不佳。为解决这一问题,我们将奖励值分解为两个独立分量:与提示无关的奖励和与提示相关的奖励。与提示无关的奖励表示仅由响应决定的评估,而与提示相关的奖励则反映源自提示和响应的奖励。我们从信息论的角度提取这两个分量,无需额外模型。随后,我们提出一种新的奖励学习算法,该算法基于样本的与提示无关奖励值对数据样本进行优先级排序。通过玩具示例,我们证明所提取的与提示无关及与提示相关的奖励能有效表征奖励模型的两个部分。此外,标准评估表明,我们的方法同时提升了奖励模型的对齐性能和泛化能力。