When prompted to think step-by-step, language models (LMs) produce a chain of thought (CoT), a sequence of reasoning steps that the model supposedly used to produce its prediction. Despite much work on CoT prompting, it is unclear if reasoning verbalized in a CoT is faithful to the models' parametric beliefs. We introduce a framework for measuring parametric faithfulness of generated reasoning, and propose Faithfulness by Unlearning Reasoning steps (FUR), an instance of this framework. FUR erases information contained in reasoning steps from model parameters, and measures faithfulness as the resulting effect on the model's prediction. Our experiments with four LMs and five multi-hop multi-choice question answering (MCQA) datasets show that FUR is frequently able to precisely change the underlying models' prediction for a given instance by unlearning key steps, indicating when a CoT is parametrically faithful. Further analysis shows that CoTs generated by models post-unlearning support different answers, hinting at a deeper effect of unlearning.
翻译:当被提示进行逐步思考时,语言模型(LMs)会产生思维链(CoT),即模型据称用于生成预测的一系列推理步骤。尽管关于CoT提示的研究众多,但尚不清楚CoT中表述的推理是否忠实于模型的参数化信念。我们提出了一个衡量生成推理的参数化忠实性的框架,并提出了该框架的一个实例——通过遗忘推理步骤实现忠实性(FUR)。FUR从模型参数中抹除推理步骤所包含的信息,并通过由此对模型预测产生的影响来度量忠实性。我们在四个语言模型和五个多跳多选问答(MCQA)数据集上的实验表明,FUR通常能够通过遗忘关键步骤,精确改变给定实例中底层模型的预测,从而指示CoT何时具有参数化忠实性。进一步分析显示,遗忘后模型生成的CoT支持不同的答案,暗示了遗忘具有更深层次的影响。