Recent work, using the Biasing Features metric, labels a CoT as unfaithful if it omits a prompt-injected hint that affected the prediction. We argue this metric confuses unfaithfulness with incompleteness, the lossy compression needed to turn distributed transformer computation into a linear natural language narrative. On multi-hop reasoning tasks with Llama-3 and Gemma-3, many CoTs flagged as unfaithful by Biasing Features are judged faithful by other metrics, exceeding 50% in some models. With a new faithful@k metric, we show that larger inference-time token budgets greatly increase hint verbalization (up to 90% in some settings), suggesting much apparent unfaithfulness is due to tight token limits. Using Causal Mediation Analysis, we further show that even non-verbalized hints can causally mediate prediction changes through the CoT. We therefore caution against relying solely on hint-based evaluations and advocate a broader interpretability toolkit, including causal mediation and corruption-based metrics.
翻译:近期研究使用偏置特征指标,将未包含影响预测的提示注入线索的思维链标记为不忠实。我们认为该指标混淆了不忠实性与不完整性,后者是将分布式Transformer计算转化为线性自然语言叙述所需的有损压缩。在Llama-3和Gemma-3的多跳推理任务中,被偏置特征指标标记为不忠实的许多思维链,经其他指标判定实为忠实,某些模型中该比例超过50%。通过新的faithful@k指标,我们证明更大的推理时令牌预算可显著提升线索显性化比例(某些设置下达90%),表明表面上的不忠实性很大程度上源于严格的令牌限制。利用因果中介分析,我们进一步证明即使未显性化的线索也能通过思维链因果中介预测变化。因此我们建议谨慎依赖基于线索的评估方法,提倡采用更广泛的解释性工具包,包括因果中介分析与基于扰动的评估指标。