Safety-aligned large language models (LLMs) are becoming increasingly widespread, especially in sensitive applications where fairness is essential and biased outputs can cause significant harm. However, evaluating the fairness of models is a complex challenge, and approaches that do so typically utilize standard question-answer (QA) styled schemes. Such methods often overlook deeper issues by interpreting the model's refusal responses as positive fairness measurements, which creates a false sense of fairness. In this work, we introduce the concept of silenced biases, which are unfair preferences encoded within models' latent space and are effectively concealed by safety-alignment. Previous approaches that considered similar indirect biases often relied on prompt manipulation or handcrafted implicit queries, which present limited scalability and risk contaminating the evaluation process with additional biases. We propose the Silenced Bias Benchmark (SBB), which aims to uncover these biases by employing activation steering to reduce model refusals during QA. SBB supports easy expansion to new demographic groups and subjects, presenting a fairness evaluation framework that encourages the future development of fair models and tools beyond the masking effects of alignment training. We demonstrate our approach over multiple LLMs, where our findings expose an alarming distinction between models' direct responses and their underlying fairness issues.
翻译:安全对齐的大语言模型(LLMs)正变得越来越普及,尤其是在公平性至关重要且偏见输出可能造成重大损害的敏感应用中。然而,评估模型的公平性是一项复杂的挑战,现有的评估方法通常采用标准问答(QA)模式。这些方法往往通过将模型的拒绝回应解读为积极的公平性度量来忽视更深层次的问题,从而产生一种虚假的公平感。在本研究中,我们提出了“被压制的偏见”这一概念,指的是编码在模型潜在空间中的不公平偏好,这些偏见被安全对齐机制有效地隐藏了起来。先前考虑类似间接偏见的方法通常依赖于提示操纵或手工构建的隐式查询,这些方法扩展性有限,并可能因引入额外偏见而污染评估过程。我们提出了“被压制偏见基准测试”(SBB),旨在通过采用激活引导来减少模型在问答过程中的拒绝行为,从而揭示这些偏见。SBB支持轻松扩展到新的人口群体和主题,提供了一个公平性评估框架,鼓励未来开发超越对齐训练掩盖效应的公平模型和工具。我们在多个LLMs上验证了我们的方法,研究结果揭示了模型直接回应与其底层公平性问题之间令人担忧的差异。