Offline preference optimization offers a simpler and more stable alternative to RLHF for aligning language models. However, their effectiveness is critically dependent on ranking accuracy, a metric where further gains are highly impactful. This limitation arises from a fundamental problem that we identify and formalize as the Overfitting-Underfitting Dilemma: current margin designs cause models to apply excessive, wasteful gradients to correctly ranked samples (overfitting) while providing insufficient corrective signals for misranked ones (underfitting). To resolve this dilemma, we propose Adaptive Margin-attached Preference Optimization (AMaPO), a simple yet principled algorithm. AMaPO employs an instance-wise adaptive margin, refined by Z-normalization and exponential scaling, which dynamically reallocates learning effort by amplifying gradients for misranked samples and suppressing them for correct ones. Extensive experiments on widely used benchmarks demonstrate that AMaPO not only achieves better ranking accuracy and superior downstream alignment performance, but targeted analysis also confirms that it successfully mitigates the core overfitting and underfitting issues.
翻译:离线偏好优化为语言模型对齐提供了一种比RLHF更简单、更稳定的替代方案。然而,其有效性关键依赖于排序准确性,而这一指标的进一步提升具有重要影响。这一局限性源于一个根本性问题,我们将其识别并形式化为过拟合-欠拟合困境:当前的边界设计导致模型对正确排序的样本施加过度且浪费的梯度(过拟合),同时对误排序样本提供不足的纠正信号(欠拟合)。为解决这一困境,我们提出了自适应边界偏好优化(AMaPO),一种简单而原理清晰的算法。AMaPO采用基于实例的自适应边界,通过Z归一化和指数缩放进行优化,动态地重新分配学习资源:放大误排序样本的梯度并抑制正确排序样本的梯度。在广泛使用的基准测试上进行的大量实验表明,AMaPO不仅实现了更好的排序准确性和更优的下游对齐性能,而且针对性分析也证实其成功缓解了核心的过拟合与欠拟合问题。