Sparse Autoencoders (SAEs) have demonstrated significant promise in interpreting the hidden states of language models by decomposing them into interpretable latent directions. However, training and interpreting SAEs at scale remains challenging, especially when large dictionary sizes are used. While decoders can leverage sparse-aware kernels for efficiency, encoders still require computationally intensive linear operations with large output dimensions. To address this, we propose KronSAE, a novel architecture that factorizes the latent representation via Kronecker product decomposition, drastically reducing memory and computational overhead. Furthermore, we introduce mAND, a differentiable activation function approximating the binary AND operation, which improves interpretability and performance in our factorized framework.
翻译:稀疏自编码器(SAEs)通过将语言模型的隐藏状态分解为可解释的潜在方向,在模型可解释性方面展现出显著潜力。然而,大规模训练和解释SAEs仍面临挑战,尤其是在使用大词典规模时。尽管解码器可利用稀疏感知核提升效率,编码器仍需处理大输出维度的计算密集型线性运算。为此,我们提出KronSAE——一种通过克罗内克积分解对潜在表示进行因子化的新型架构,该架构能大幅降低内存与计算开销。此外,我们引入mAND(一种逼近二元AND运算的可微激活函数),该函数在我们提出的因子化框架中有效提升了可解释性与性能表现。