Multi-Objective Alignment (MOA) aims to align LLMs' responses with multiple human preference objectives, with Direct Preference Optimization (DPO) emerging as a prominent approach. However, we find that DPO-based MOA approaches suffer from widespread preference conflicts in the data, where different objectives favor different responses. This results in conflicting optimization directions, hindering the optimization on the Pareto Front. To address this, we propose to construct Pareto-optimal responses to resolve preference conflicts. To efficiently obtain and utilize such responses, we propose a self-improving DPO framework that enables LLMs to self-generate and select Pareto-optimal responses for self-supervised preference alignment. Extensive experiments on two datasets demonstrate the superior Pareto Front achieved by our framework compared to various baselines. Code is available at https://github.com/zyttt-coder/SIPO.
翻译:多目标对齐(MOA)旨在使大语言模型(LLM)的响应与多个人类偏好目标对齐,其中直接偏好优化(DPO)已成为一种主流方法。然而,我们发现基于DPO的MOA方法在数据中普遍存在偏好冲突,即不同目标偏好不同的响应。这导致优化方向相互矛盾,阻碍了在帕累托前沿上的优化。为解决此问题,我们提出构建帕累托最优响应以化解偏好冲突。为高效获取并利用此类响应,我们提出一种自改进的DPO框架,使LLM能够自生成并选择帕累托最优响应,进行自监督的偏好对齐。在两个数据集上的大量实验表明,相较于多种基线方法,我们的框架实现了更优的帕累托前沿。代码可在 https://github.com/zyttt-coder/SIPO 获取。