Sparse autoencoders (SAEs) aim to disentangle model activations into monosemantic, human-interpretable features. In practice, learned features are often redundant and vary across training runs and sparsity levels, which makes interpretations difficult to transfer and reuse. We introduce Distilled Matryoshka Sparse Autoencoders (DMSAEs), a training pipeline that distills a compact core of consistently useful features and reuses it to train new SAEs. DMSAEs run an iterative distillation cycle: train a Matryoshka SAE with a shared core, use gradient X activation to measure each feature's contribution to next-token loss in the most nested reconstruction, and keep only the smallest subset that explains a fixed fraction of the attribution. Only the core encoder weight vectors are transferred across cycles; the core decoder and all non-core latents are reinitialized each time. On Gemma-2-2B layer 12 residual stream activations, seven cycles of distillation (500M tokens, 65k width) yielded a distilled core of 197 features that were repeatedly selected. Training using this distilled core improves several SAEBench metrics and demonstrates that consistent sets of latent features can be transferred across sparsity levels
翻译:稀疏自编码器旨在将模型激活解耦为单语义、人类可解释的特征。在实践中,学习到的特征常常是冗余的,并且会因训练轮次和稀疏度水平的不同而变化,这使得解释难以迁移和复用。我们引入了蒸馏式Matryoshka稀疏自编码器,这是一种训练流程,它蒸馏出一个紧凑的、持续有用的特征核心,并复用它来训练新的稀疏自编码器。DMSAEs运行一个迭代蒸馏循环:训练一个具有共享核心的Matryoshka稀疏自编码器,使用梯度乘以激活的方法来度量每个特征在最内层重构中对下一词元损失的贡献,并仅保留能够解释固定比例归因的最小特征子集。只有核心编码器的权重向量在循环间迁移;核心解码器和所有非核心的隐变量每次都会重新初始化。在Gemma-2-2B模型第12层残差流激活上,经过七个蒸馏循环(5亿词元,宽度6.5万),得到了一个包含197个被重复选择特征的蒸馏核心。使用该蒸馏核心进行训练改善了多项SAEBench指标,并证明了可以在不同稀疏度水平间迁移一致的隐特征集合。