Large Language Models (LLMs) such as GPT, LLaMA, and Claude achieve remarkable performance in text generation but remain opaque in their decision-making processes, limiting trust and accountability in high-stakes applications. We present gSMILE (generative SMILE), a model-agnostic, perturbation-based framework for token-level interpretability in LLMs. Extending the SMILE methodology, gSMILE uses controlled prompt perturbations, Wasserstein distance metrics, and weighted linear surrogates to identify input tokens with the most significant impact on the output. This process enables the generation of intuitive heatmaps that visually highlight influential tokens and reasoning paths. We evaluate gSMILE across leading LLMs (OpenAI's gpt-3.5-turbo-instruct, Meta's LLaMA 3.1 Instruct Turbo, and Anthropic's Claude 2.1) using attribution fidelity, attribution consistency, attribution stability, attribution faithfulness, and attribution accuracy as metrics. Results show that gSMILE delivers reliable human-aligned attributions, with Claude 2.1 excelling in attention fidelity and GPT-3.5 achieving the highest output consistency. These findings demonstrate gSMILE's ability to balance model performance and interpretability, enabling more transparent and trustworthy AI systems.
翻译:诸如GPT、LLaMA和Claude等大语言模型在文本生成方面取得了显著性能,但其决策过程仍然不透明,限制了在高风险应用中的可信度与问责性。本文提出gSMILE(生成式SMILE),一种与模型无关、基于扰动的框架,用于实现大语言模型的词元级可解释性。gSMILE扩展了SMILE方法学,通过受控提示扰动、Wasserstein距离度量和加权线性替代模型来识别对输出影响最显著的输入词元。该过程能够生成直观的热力图,以可视化方式突出显示有影响力的词元与推理路径。我们使用归因保真度、归因一致性、归因稳定性、归因忠实度和归因准确性作为评估指标,在主流大语言模型(OpenAI的gpt-3.5-turbo-instruct、Meta的LLaMA 3.1 Instruct Turbo以及Anthropic的Claude 2.1)上对gSMILE进行了评估。结果表明,gSMILE能提供可靠且与人类认知对齐的归因分析,其中Claude 2.1在注意力保真度方面表现突出,而GPT-3.5实现了最高的输出一致性。这些发现证明了gSMILE在平衡模型性能与可解释性方面的能力,有助于构建更透明、更可信的人工智能系统。