Textual data used to train large language models (LLMs) exhibits multifaceted bias manifestations encompassing harmful language and skewed demographic distributions. Regulations such as the European AI Act require identifying and mitigating biases against protected groups in data, with the ultimate goal of preventing unfair model outputs. However, practical guidance and operationalization are lacking. We propose a comprehensive data bias detection and mitigation pipeline comprising four components that address two data bias types, namely representation bias and (explicit) stereotypes for a configurable sensitive attribute. First, we leverage LLM-generated word lists created based on quality criteria to detect relevant group labels. Second, representation bias is quantified using the Demographic Representation Score. Third, we detect and mitigate stereotypes using sociolinguistically informed filtering. Finally, we compensate representation bias through Grammar- and Context-Aware Counterfactual Data Augmentation. We conduct a two-fold evaluation using the examples of gender, religion and age. First, the effectiveness of each individual component on data debiasing is evaluated through human validation and baseline comparison. The findings demonstrate that we successfully reduce representation bias and (explicit) stereotypes in a text dataset. Second, the effect of data debiasing on model bias reduction is evaluated by bias benchmarking of several models (0.6B-8B parameters), fine-tuned on the debiased text dataset. This evaluation reveals that LLMs fine-tuned on debiased data do not consistently show improved performance on bias benchmarks, exposing critical gaps in current evaluation methodologies and highlighting the need for targeted data manipulation to address manifested model bias.
翻译:用于训练大语言模型(LLMs)的文本数据呈现出多方面的偏见表现,包括有害语言和倾斜的人口统计分布。诸如《欧盟人工智能法案》等法规要求识别并缓解数据中针对受保护群体的偏见,其最终目标是防止模型产生不公平的输出。然而,目前缺乏实用的指导和操作化方案。我们提出了一个全面的数据偏见检测与缓解流程,该流程包含四个组件,针对两种数据偏见类型——即可配置敏感属性的表征偏见和(显性)刻板印象。首先,我们利用基于质量标准生成的大语言模型词表来检测相关群体标签。其次,使用人口统计表征分数对表征偏见进行量化。第三,我们通过社会语言学信息过滤来检测并缓解刻板印象。最后,通过语法和上下文感知的反事实数据增强来补偿表征偏见。我们以性别、宗教和年龄为例进行了双重评估。首先,通过人工验证和基线比较,评估了每个独立组件在数据去偏见方面的有效性。研究结果表明,我们成功减少了文本数据集中的表征偏见和(显性)刻板印象。其次,通过对多个模型(0.6B-8B参数)在去偏见文本数据集上微调后进行偏见基准测试,评估了数据去偏见对模型偏见减少的影响。该评估揭示,在去偏见数据上微调的大语言模型并未在偏见基准测试中持续表现出性能提升,这暴露了当前评估方法的关键缺陷,并突显了需要通过有针对性的数据操作来解决已显现的模型偏见。