模型激活的自然语言描述是否传递特权信息？ (Do Natural Language Descriptions of Model Activations Convey Privileged Information?)

Recent interpretability methods have proposed to translate LLM internal representations into natural language descriptions using a second verbalizer LLM. This is intended to illuminate how the target model represents and operates on inputs. But do such activation verbalization approaches actually provide privileged knowledge about the internal workings of the target model, or do they merely convey information about its inputs? We critically evaluate popular verbalization methods across datasets used in prior work and find that they can succeed at benchmarks without any access to target model internals, suggesting that these datasets may not be ideal for evaluating verbalization methods. We then run controlled experiments which reveal that verbalizations often reflect the parametric knowledge of the verbalizer LLM which generated them, rather than the knowledge of the target LLM whose activations are decoded. Taken together, our results indicate a need for targeted benchmarks and experimental controls to rigorously assess whether verbalization methods provide meaningful insights into the operations of LLMs.

翻译：近期可解释性方法提出使用第二个语言化大语言模型（verbalizer LLM）将大语言模型内部表征转换为自然语言描述。该方法旨在揭示目标模型如何表征和处理输入。然而，此类激活语言化方法是否真正提供了关于目标模型内部运作的特权知识，抑或仅传递了其输入信息？我们通过对先前研究使用的数据集进行批判性评估发现，即使完全无法访问目标模型内部状态，这些方法仍可在基准测试中取得成功，表明现有数据集可能并非评估语言化方法的理想选择。进一步控制实验表明，语言化结果往往反映生成描述的语言化大语言模型本身的参数化知识，而非被解码激活的目标大语言模型的知识。综合而言，我们的研究结果表明，需要建立针对性基准测试与实验控制机制，以严格评估语言化方法是否真正为理解大语言模型运作机制提供有效洞察。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/