Aligning large language models with human preferences is essential for improving interaction quality and safety by ensuring outputs better reflect human values. A promising strategy involves Reinforcement Learning from Human Feedback (RLHF), starting with collecting and ranking responses generated by a supervised fine-tuning model to refine alignment. Existing methods such as Direct Preference Optimization (DPO) focus on pairwise comparisons, categorizing responses into preferred and less preferred pairs and optimizing pairwise margins. However, this pairwise approach cannot capture the holistic ranking relationships among multiple responses or effectively leverage the rich preference information available in list-wise comparisons. To address this challenge, this paper introduces \underline{D}irect \underline{R}anking \underline{P}reference \underline{O}ptimization (DRPO), a novel method that views human preference alignment as a Learning-to-Rank (LTR) task. Unlike pairwise methods, DRPO optimizes the preference ranking of entire response lists by computing holistic utility scores through NDCG, a standard LTR metric. To enable end-to-end optimization with the non-differentiable NDCG, we propose diffNDCG loss, a differentiable approximation facilitated by a sorting network. Furthermore, we introduce a novel margin-based Adaptive Rank Policy Score to enhance the discriminative quality of generated responses. Extensive experiments have shown that DRPO outperforms existing methods, enhancing the quality of the generated responses.
翻译:将大型语言模型与人类偏好对齐对于提升交互质量和安全性至关重要,通过确保输出更好地反映人类价值观。一种有前景的策略是基于人类反馈的强化学习(RLHF),首先收集并排序由监督微调模型生成的响应,以优化对齐效果。现有方法如直接偏好优化(DPO)侧重于成对比较,将响应分类为偏好对和非偏好对,并优化成对边际。然而,这种成对方法无法捕捉多个响应之间的整体排序关系,也无法有效利用列表比较中丰富的偏好信息。为应对这一挑战,本文提出直接排序偏好优化(DRPO),一种将人类偏好对齐视为学习排序(LTR)任务的新方法。与成对方法不同,DRPO通过标准LTR指标NDCG计算整体效用分数,优化整个响应列表的偏好排序。为实现与不可微NDCG的端到端优化,我们提出diffNDCG损失,这是一种通过排序网络实现的可微近似方法。此外,我们引入了一种新颖的基于边际的自适应排序策略分数,以增强生成响应的判别质量。大量实验表明,DRPO优于现有方法,提升了生成响应的质量。