AI systems that output their reasoning in natural language offer an opportunity for safety -- we can \emph{monitor} their chain of thought (CoT) for undesirable reasoning, such as the pursuit of harmful objectives. However, the extent to which CoT faithfully reflects the underlying reasoning process, and hence the extent to which it can be usefully monitored, may be influenced by certain aspects of training. We investigate how different \emph{training incentives}, applied to a reasoning model, affect its monitorability. We introduce a novel methodology for measuring monitorability according to whether a monitor can predict a key latent variable using the model's reasoning. When controlling for accuracy, we do not find evidence for consistent effects from commonly used incentives (length penalties and KL regularisation), but we find that adversarial optimisation (penalising monitor accuracy) degrades monitor performance, while direct optimisation for monitorability does not reliably lead to improvements. Our code is available at https://github.com/QiyaoWei/reasoning-under-pressure.
翻译:以自然语言输出推理过程的AI系统为安全性提供了机遇——我们可以对其思维链(CoT)进行监控,以识别不良推理(例如追求有害目标)。然而,思维链在多大程度上忠实反映底层推理过程,进而其可监控性的有效程度,可能受到训练中特定因素的影响。本研究探讨了应用于推理模型的不同训练激励如何影响其可监控性。我们提出了一种新颖的方法论,通过监控器能否利用模型推理预测关键潜变量来衡量可监控性。在控制准确率的情况下,未发现常用激励(长度惩罚和KL正则化)产生一致影响的证据,但发现对抗性优化(惩罚监控器准确率)会降低监控性能,而直接针对可监控性进行优化并未可靠地带来改进。代码发布于 https://github.com/QiyaoWei/reasoning-under-pressure。