大型语言模型通过自适应探索发展新型社会偏见 (Large Language Models Develop Novel Social Biases Through Adaptive Exploration)

As large language models (LLMs) are adopted into frameworks that grant them the capacity to make real decisions, it is increasingly important to ensure that they are unbiased. In this paper, we argue that the predominant approach of simply removing existing biases from models is not enough. Using a paradigm from the psychology literature, we demonstrate that LLMs can spontaneously develop novel social biases about artificial demographic groups even when no inherent differences exist. These biases result in highly stratified task allocations, which are less fair than assignments by human participants and are exacerbated by newer and larger models. In social science, emergent biases like these have been shown to result from exploration-exploitation trade-offs, where the decision-maker explores too little, allowing early observations to strongly influence impressions about entire demographic groups. To alleviate this effect, we examine a series of interventions targeting model inputs, problem structure, and explicit steering. We find that explicitly incentivizing exploration most robustly reduces stratification, highlighting the need for better multifaceted objectives to mitigate bias. These results reveal that LLMs are not merely passive mirrors of human social biases, but can actively create new ones from experience, raising urgent questions about how these systems will shape societies over time.

翻译：随着大型语言模型（LLMs）被应用于赋予其实际决策能力的框架中，确保其无偏见变得日益重要。本文认为，当前主流方法仅从模型中移除现有偏见是不够的。通过借鉴心理学文献中的范式，我们证明即使不存在内在差异，LLMs仍能对人工构建的人口群体自发产生新型社会偏见。这些偏见导致高度分层的任务分配，其公平性低于人类参与者的分配结果，且在新一代更大规模的模型中更为显著。在社会科学中，此类涌现性偏见已被证明源于探索-利用权衡，即决策者探索不足，使得早期观察结果强烈影响对整个人口群体的印象。为缓解此效应，我们研究了一系列针对模型输入、问题结构和显式引导的干预措施。发现明确激励探索能最稳健地减少分层现象，凸显了需要更完善的多层面目标以减轻偏见。这些结果表明，LLMs不仅是人类社会偏见的被动映射，更能从经验中主动创造新的偏见，这引发了关于此类系统将如何随时间塑造社会的紧迫问题。