Large language models (LLMs) exhibit distinct and consistent personalities that greatly impact trust and engagement. While this means that personality frameworks would be highly valuable tools to characterize and control LLMs' behavior, current approaches remain either costly (post-training) or brittle (prompt engineering). Probing and steering via linear directions has recently emerged as a cheap and efficient alternative. In this paper, we investigate whether linear directions aligned with the Big Five personality traits can be used for probing and steering model behavior. Using Llama 3.3 70B, we generate descriptions of 406 fictional characters and their Big Five trait scores. We then prompt the model with these descriptions and questions from the Alpaca questionnaire, allowing us to sample hidden activations that vary along personality traits in known, quantifiable ways. Using linear regression, we learn a set of per-layer directions in activation space, and test their effectiveness for probing and steering model behavior. Our results suggest that linear directions aligned with trait-scores are effective probes for personality detection, while their steering capabilities strongly depend on context, producing reliable effects in forced-choice tasks but limited influence in open-ended generation or when additional context is present in the prompt.
翻译:大型语言模型(LLMs)展现出独特且一致的人格特征,这些特征极大地影响了用户对其的信任度与互动参与度。这意味着人格理论框架将成为描述和控制LLM行为极具价值的工具,然而现有方法要么成本高昂(需进行后训练),要么脆弱不稳定(依赖提示工程)。近期,通过线性方向进行探测和引导作为一种廉价且高效的替代方案崭露头角。本文研究了与大五人格特质对齐的线性方向是否可用于探测和引导模型行为。我们使用Llama 3.3 70B模型,生成了406个虚构角色及其大五人格特质得分的描述。随后,我们向模型输入这些描述以及来自Alpaca问卷的问题,从而能够采样那些沿着已知、可量化的人格特质维度变化的隐藏激活状态。通过线性回归,我们在激活空间中学习了一组逐层的方向向量,并测试了它们在探测和引导模型行为方面的有效性。我们的结果表明,与特质得分对齐的线性方向是进行人格探测的有效工具,而其引导能力则高度依赖于上下文:在强制选择任务中能产生可靠的效果,但在开放式生成任务中,或当提示中包含额外上下文信息时,其影响力则较为有限。