面向列表对齐的整体效用偏好学习 (Holistic Utility Preference Learning for Listwise Alignment)

Aligning large language models with human preferences is essential for improving interaction quality and safety by ensuring outputs better reflect human values. A promising strategy involves Reinforcement Learning from Human Feedback (RLHF), starting with collecting and ranking responses generated by a supervised fine-tuning model to refine alignment. Existing methods such as Direct Preference Optimization (DPO) focus on pairwise comparisons, categorizing responses into preferred and less preferred pairs and optimizing pairwise margins. However, this pairwise approach cannot capture the holistic ranking relationships among multiple responses or effectively leverage the rich preference information available in list-wise comparisons. To address this challenge, this paper introduces \underline{D}irect \underline{R}anking \underline{P}reference \underline{O}ptimization (DRPO), a novel method that views human preference alignment as a Learning-to-Rank (LTR) task. Unlike pairwise methods, DRPO optimizes the preference ranking of entire response lists by computing holistic utility scores through NDCG, a standard LTR metric. To enable end-to-end optimization with the non-differentiable NDCG, we propose diffNDCG loss, a differentiable approximation facilitated by a sorting network. Furthermore, we introduce a novel margin-based Adaptive Rank Policy Score to enhance the discriminative quality of generated responses. Extensive experiments have shown that DRPO outperforms existing methods, enhancing the quality of the generated responses.

翻译：将大型语言模型与人类偏好对齐对于提升交互质量和安全性至关重要，通过确保输出更好地反映人类价值观。一种有前景的策略是基于人类反馈的强化学习（RLHF），首先收集并排序由监督微调模型生成的响应，以优化对齐效果。现有方法如直接偏好优化（DPO）侧重于成对比较，将响应分类为偏好对和非偏好对，并优化成对边际。然而，这种成对方法无法捕捉多个响应之间的整体排序关系，也无法有效利用列表比较中丰富的偏好信息。为应对这一挑战，本文提出直接排序偏好优化（DRPO），一种将人类偏好对齐视为学习排序（LTR）任务的新方法。与成对方法不同，DRPO通过标准LTR指标NDCG计算整体效用分数，优化整个响应列表的偏好排序。为实现与不可微NDCG的端到端优化，我们提出diffNDCG损失，这是一种通过排序网络实现的可微近似方法。此外，我们引入了一种新颖的基于边际的自适应排序策略分数，以增强生成响应的判别质量。大量实验表明，DRPO优于现有方法，提升了生成响应的质量。

相关内容

排序

关注 313

排序是计算机内经常进行的一种操作，其目的是将一组“无序”的记录序列调整为“有序”的记录序列。分内部排序和外部排序。若整个排序过程不需要访问外存便能完成，则称此类排序问题为内部排序。反之，若参加排序的记录数量很大，整个序列的排序过程不可能在内存中完成，则称此类排序问题为外部排序。内部排序的过程是一个逐步扩大记录的有序序列长度的过程。

【ICML2024】社区不变图对比学习

专知会员服务

24+阅读 · 2024年5月4日

【WWW2024】RecDCL: 双重对比学习用于推荐

专知会员服务

23+阅读 · 2024年1月30日

【CVPR 2022】基于双噪声标签的可见光-红外人再识别学习，Learning with Twin Noisy Labels for Visible-Infrared Person Re-Identification

专知会员服务

14+阅读 · 2022年3月28日

【ACL2022】一个用于远距监督关系抽取的层级对比学习框架, HiCLRE: A Hierarchical Contrastive Learning Framework for Distantly Supervised Relation Extraction

专知会员服务

15+阅读 · 2022年3月24日