When an LLM learns a new fact during finetuning (e.g., new movie releases, newly elected pope, etc.), where does this information go? Are entities enriched with relation information, or do models recall information just-in-time before a prediction? Or, are ``all of the above'' true with LLMs implementing multiple redundant heuristics? Existing localization approaches (e.g., activation patching) are ill-suited for this analysis because they usually \textit{replace} parts of the residual stream, thus overriding previous information. To fill this gap, we propose \emph{dynamic weight grafting}, a technique that selectively grafts weights from a finetuned model onto a pretrained model. Using this technique, we show two separate pathways for retrieving finetuned relation information: 1) ``enriching" the residual stream with relation information while processing the tokens that correspond to an entity (e.g., ``Zendaya'' in ``Zendaya co-starred with John David Washington'') and 2) ``recalling" this information at the final token position before generating a target fact. In some cases, models need information from both of these pathways to correctly generate finetuned facts while, in other cases, either the ``enrichment" or ``recall" pathway alone is sufficient. We localize the ``recall'' pathway to model components -- finding that ``recall" occurs via both task-specific attention mechanisms and an entity-specific extraction step in the feedforward networks of the final layers before the target prediction. By targeting model components and parameters, as opposed to just activations, we are able to understand the \textit{mechanisms} by which finetuned knowledge is retrieved during generation.
翻译:当大型语言模型(LLM)在微调过程中学习新事实(例如新上映的电影、新当选的教宗等)时,这些信息存储在何处?是实体被关联信息所增强,还是模型在预测前即时回忆信息?抑或上述情况均成立,即LLM通过多重冗余启发式策略实现信息处理?现有的定位方法(如激活修补)不适用于此类分析,因为它们通常替换残差流的部分内容,从而覆盖先前信息。为填补这一空白,我们提出动态权重嫁接技术,该技术可选择性地将微调模型的权重嫁接至预训练模型。运用该技术,我们揭示了检索微调关联信息的两条独立路径:1)在处理对应实体的词元时(如“Zendaya co-starred with John David Washington”中的“Zendaya”)通过“增强”残差流注入关联信息;2)在生成目标事实前的最终词元位置“回忆”该信息。在某些情况下,模型需要同时利用这两条路径才能正确生成微调后的事实,而在其他情况下,仅靠“增强”或“回忆”单一路径即可实现。我们将“回忆”路径定位至模型组件——发现“回忆”通过两种机制实现:任务特定的注意力机制,以及在目标预测前最终层前馈网络中进行实体特定的信息提取步骤。通过针对模型组件和参数(而非仅激活值)进行分析,我们得以理解微调知识在生成过程中被检索的内在机制。