Post-hoc explanation methods for black-box models often struggle with faithfulness and human interpretability due to the lack of explainability in current neural architectures. Meanwhile, B-cos networks have been introduced to improve model explainability by proposing an architecture that removes bias terms and promotes input-weight alignment. Although B-cos networks have shown success in building explainable systems, their application has so far been limited to computer vision models and their associated training pipelines. In this work, we introduce B-cos LMs, i.e., B-cos Language Models (LMs) empowered for natural language processing (NLP) tasks. Our approach directly transforms pre-trained language models into B-cos LMs by combining B-cos conversion and task fine-tuning, improving efficiency compared to previous methods. Automatic and human evaluation results demonstrate that B-cos LMs produce more faithful and human interpretable explanations than post-hoc methods, while maintaining task performance comparable to conventional fine-tuning. Our in-depth analysis explores how B-cos LMs differ from conventionally fine-tuned models in their learning processes and explanation patterns. Finally, we present a first exploration of transforming decoder-only models to B-cos LMs for generation tasks. Our code is available at https://github.com/Ewanwong/bcos_lm.
翻译:针对黑盒模型的事后解释方法常因当前神经架构缺乏可解释性而在忠实度和人类可理解性方面面临挑战。与此同时,B-cos网络通过提出一种消除偏置项并促进输入-权重对齐的架构,被引入以提升模型可解释性。尽管B-cos网络在构建可解释系统方面已取得成功,但其应用迄今仍局限于计算机视觉模型及相关训练流程。本工作中,我们提出B-cos LMs,即赋能自然语言处理(NLP)任务的B-cos语言模型(LMs)。我们的方法通过结合B-cos转换与任务微调,直接将预训练语言模型转化为B-cos LMs,相比先前方法提升了效率。自动与人工评估结果表明,B-cos LMs能产生比事后解释方法更忠实且更易于人类理解的解释,同时保持与传统微调相当的任务性能。我们通过深入分析探讨了B-cos LMs在学习过程和解释模式上与传统微调模型的差异。最后,我们首次探索了将仅解码器模型转化为B-cos LMs以用于生成任务。代码公开于https://github.com/Ewanwong/bcos_lm。