DIF：一种用于基准测试和验证大型语言模型中隐性偏见的框架 (DIF: A Framework for Benchmarking and Verifying Implicit Bias in LLMs)

As Large Language Models (LLMs) have risen in prominence over the past few years, there has been concern over the potential biases in LLMs inherited from the training data. Previous studies have examined how LLMs exhibit implicit bias, such as when response generation changes when different social contexts are introduced. We argue that this implicit bias is not only an ethical, but also a technical issue, as it reveals an inability of LLMs to accommodate extraneous information. However, unlike other measures of LLM intelligence, there are no standard methods to benchmark this specific subset of LLM bias. To bridge this gap, we developed a method for calculating an easily interpretable benchmark, DIF (Demographic Implicit Fairness), by evaluating preexisting LLM logic and math problem datasets with sociodemographic personas, which is combined with a statistical robustness check using a null model. We demonstrate that this method can validate the presence of implicit bias in LLM behavior and find an novel inverse trend between question answering accuracy and implicit bias, supporting our argument.

翻译：随着大型语言模型（LLMs）在过去几年中日益凸显，人们对其可能从训练数据中继承的潜在偏见表示担忧。先前的研究已探讨了LLMs如何表现出隐性偏见，例如当引入不同社会语境时，其响应生成会发生变化。我们认为这种隐性偏见不仅是伦理问题，也是技术问题，因为它揭示了LLMs无法有效处理额外信息。然而，与其他衡量LLM智能的指标不同，目前尚无标准方法来基准测试这一特定类型的LLM偏见。为填补这一空白，我们开发了一种计算易于解释的基准指标DIF（人口统计隐性公平性）的方法，该方法通过使用社会人口统计角色评估现有的LLM逻辑和数学问题数据集，并结合基于零模型的统计稳健性检验。我们证明该方法能够验证LLM行为中隐性偏见的存在，并发现了问题回答准确性与隐性偏见之间新颖的反向趋势，从而支持了我们的论点。