Bias in Large Language Models (LLMs) poses significant risks to trustworthiness, manifesting primarily as stereotypical biases (e.g., gender or racial stereotypes) and structural biases (e.g., lexical overlap or position preferences). However, prior paradigms typically address these in isolation, often mitigating one at the expense of exacerbating the other. To address this, we conduct a systematic exploration of these reasoning failures and identify a primary inducement: the latent spurious feature correlations within the input that drive these erroneous reasoning shortcuts. Driven by these findings, we introduce Causal-Contrastive Preference Optimization (C2PO), a unified alignment framework designed to tackle these specific failures by simultaneously discovering and suppressing these correlations directly within the optimization process. Specifically, C2PO leverages causal counterfactual signals to isolate bias-inducing features from valid reasoning paths, and employs a fairness-sensitive preference update mechanism to dynamically evaluate logit-level contributions and suppress shortcut features. Extensive experiments across multiple benchmarks covering stereotypical bias (BBQ, Unqover), structural bias (MNLI, HANS, Chatbot, MT-Bench), out-of-domain fairness (StereoSet, WinoBias), and general utility (MMLU, GSM8K) demonstrate that C2PO effectively mitigates stereotypical and structural biases while preserving robust general reasoning capabilities.
翻译:大型语言模型(LLMs)中的偏见对可信度构成显著风险,主要表现为刻板印象偏见(如性别或种族刻板印象)和结构偏见(如词汇重叠或位置偏好)。然而,现有研究范式通常孤立地处理这些问题,往往在缓解一种偏见的同时加剧另一种。为解决此问题,我们对这些推理失败进行了系统性探索,并识别出一个主要诱因:输入中潜在的虚假特征关联驱动了这些错误的推理捷径。基于这些发现,我们提出了因果对比偏好优化(Causal-Contrastive Preference Optimization, C2PO),这是一个统一的校准框架,旨在通过在优化过程中直接同时发现并抑制这些关联来应对这些特定失败。具体而言,C2PO利用因果反事实信号将诱发偏见的特征从有效推理路径中分离出来,并采用公平敏感的偏好更新机制动态评估对数层面的贡献以抑制捷径特征。在涵盖刻板印象偏见(BBQ、Unqover)、结构偏见(MNLI、HANS、Chatbot、MT-Bench)、领域外公平性(StereoSet、WinoBias)以及通用能力(MMLU、GSM8K)的多个基准测试上的广泛实验表明,C2PO能有效缓解刻板印象和结构偏见,同时保持稳健的通用推理能力。