The alignment of Large Language Models (LLMs) is crucial for ensuring their safety and reliability in practical applications. Direct Preference Optimization (DPO) has emerged as an efficient method that directly optimizes models using preference pairs, significantly reducing resource demands. However, the effectiveness of DPO heavily depends on the data quality, which is frequently compromised by noise. In this work, we propose $\gamma$-PO, a dynamic target margin preference optimization algorithm that adjust reward margins at the pairwise level. By introducing instance-specific margin calibration, $\gamma$-PO strategically prioritizes high-confidence pairs (those demonstrating higher reward margins) while suppressing potential noise from ambiguous pairs. Moreover, $\gamma$-PO is a plug-and-play method, compatible with variants of DPO that rely on reward margin between preference pairs. Across benchmarks such as AlpacaEval2 and Arena-Hard, $\gamma$-PO achieves an average 4.4\% improvement over other baselines, setting new benchmarks for state-of-the-art performance. Additionally, $\gamma$-PO requires minimal code changes and has a negligible impact on training efficiency, making it a robust solution for enhancing LLMs alignment. Our codes are available at \href{https://github.com/sunjie279/gammaPO}{https://github.com/sunjie279/gammaPO}.
翻译:大型语言模型(LLMs)的对齐对于确保其在实际应用中的安全性和可靠性至关重要。直接偏好优化(DPO)作为一种高效方法,通过直接利用偏好对优化模型,显著降低了资源需求。然而,DPO的有效性高度依赖于数据质量,而数据质量常因噪声而受损。本文提出γ-PO,一种动态目标边界的偏好优化算法,可在成对层面调整奖励边界。通过引入实例特定的边界校准,γ-PO策略性地优先处理高置信度对(即展示更高奖励边界的对),同时抑制来自模糊对的潜在噪声。此外,γ-PO是一种即插即用方法,与依赖偏好对间奖励边界的DPO变体兼容。在AlpacaEval2和Arena-Hard等基准测试中,γ-PO相比其他基线平均提升4.4%,创造了新的最先进性能基准。同时,γ-PO仅需极少的代码修改,对训练效率影响可忽略,是增强LLMs对齐的鲁棒解决方案。我们的代码发布于https://github.com/sunjie279/gammaPO。