In a striking neuroscience study, the authors placed a dead salmon in an MRI scanner and showed it images of humans in social situations. Astonishingly, standard analyses of the time reported brain regions predictive of social emotions. The explanation, of course, was not supernatural cognition but a cautionary tale about misapplied statistical inference. In AI interpretability, reports of similar ''dead salmon'' artifacts abound: feature attribution, probing, sparse auto-encoding, and even causal analyses can produce plausible-looking explanations for randomly initialized neural networks. In this work, we examine this phenomenon and argue for a pragmatic statistical-causal reframing: explanations of computational systems should be treated as parameters of a (statistical) model, inferred from computational traces. This perspective goes beyond simply measuring statistical variability of explanations due to finite sampling of input data; interpretability methods become statistical estimators, and findings should be tested against explicit and meaningful alternative computational hypotheses, with uncertainty quantified with respect to the postulated statistical model. It also highlights important theoretical issues, such as the identifiability of common interpretability queries, which we argue is critical to understand the field's susceptibility to false discoveries, poor generalizability, and high variance. More broadly, situating interpretability within the standard toolkit of statistical inference opens promising avenues for future work aimed at turning AI interpretability into a pragmatic and rigorous science.
翻译:在一项引人注目的神经科学研究中,作者将一条死鲑鱼放入磁共振成像扫描仪,并向其展示人类处于社交情境中的图像。令人惊讶的是,当时的标准分析竟报告了可预测社交情绪的大脑区域。当然,解释并非超自然认知,而是一个关于误用统计推断的警示故事。在人工智能可解释性领域,类似的“死鲑鱼”伪影比比皆是:特征归因、探测、稀疏自编码,甚至因果分析,都可能为随机初始化的神经网络产生看似合理的解释。在本工作中,我们审视了这一现象,并主张一种实用的统计-因果重构:计算系统的解释应被视为(统计)模型的参数,从计算轨迹中推断得出。这一视角超越了仅因输入数据有限采样而衡量解释的统计变异性;可解释性方法成为统计估计量,其发现应针对明确且有意义的替代计算假设进行检验,并在所假设的统计模型下量化不确定性。它还突显了重要的理论问题,例如常见可解释性查询的可识别性,我们认为这对于理解该领域易受错误发现、泛化能力差和高方差影响至关重要。更广泛地说,将可解释性置于统计推断的标准工具包中,为未来工作开辟了有前景的途径,旨在将人工智能可解释性转变为一门实用且严谨的科学。