大语言模型对齐的基本局限性 (Fundamental Limitations of Alignment in Large Language Models)

An important aspect in developing language models that interact with humans is aligning their behavior to be useful and unharmful for their human users. This is usually achieved by tuning the model in a way that enhances desired behaviors and inhibits undesired ones, a process referred to as alignment. In this paper, we propose a theoretical approach called Behavior Expectation Bounds (BEB) which allows us to formally investigate several inherent characteristics and limitations of alignment in large language models. Importantly, we prove that for any behavior that has a finite probability of being exhibited by the model, there exist prompts that can trigger the model into outputting this behavior, with probability that increases with the length of the prompt. This implies that any alignment process that attenuates undesired behavior but does not remove it altogether, is not safe against adversarial prompting attacks. Furthermore, our framework hints at the mechanism by which leading alignment approaches such as reinforcement learning from human feedback increase the LLM's proneness to being prompted into the undesired behaviors. Moreover, we include the notion of personas in our BEB framework, and find that behaviors which are generally very unlikely to be exhibited by the model can be brought to the front by prompting the model to behave as specific persona. This theoretical result is being experimentally demonstrated in large scale by the so called contemporary "chatGPT jailbreaks", where adversarial users trick the LLM into breaking its alignment guardrails by triggering it into acting as a malicious persona. Our results expose fundamental limitations in alignment of LLMs and bring to the forefront the need to devise reliable mechanisms for ensuring AI safety.

翻译：在开发与人类互动的语言模型时，重要的考虑因素是调整模型的行为，使其对人类用户有用且不会造成危害。通常通过调节模型来增强期望行为并抑制不期望行为来实现这一点，这个过程称为对齐。在本文中，我们提出了一种理论方法，称为行为期望界限(BEB)，它允许我们形式化地研究大语言模型对齐的几个内在特性和局限性。重要的是，我们证明对于任何一个具有发生概率的行为，存在能够触发模型输出这种行为的提示，且随着提示长度的增加，触发这种行为的概率也随之增加。这意味着任何对齐过程，只要不能完全去除不期望的行为，就不安全，容易遭受对抗性提示攻击的风险。此外，我们的框架还暗示了一些领先的对齐方法(如基于人类反馈的强化学习)如何增加大语言模型发生不期望行为的风险。此外，我们在行为期望界限(BEB)框架中引入了人物形象的概念，并发现可以通过提示模型采用特定的人物角色来将通常不太可能出现的行为带到前面。这一理论结果正在通过所谓的现代“chatGPT越狱”得到大规模实验证实，其中对抗性用户通过触发大型语言模型来行为故意变坏。我们的结果揭示了大语言模型对齐的基本限制，并凸显了确保人工智能安全的可靠机制的必要性。