平衡:全球到地方的建筑-基于概念的概念解释 (GLANCE: Global to Local Architecture-Neutral Concept-based Explanations)

Most of the current explainability techniques focus on capturing the importance of features in input space. However, given the complexity of models and data-generating processes, the resulting explanations are far from being `complete', in that they lack an indication of feature interactions and visualization of their `effect'. In this work, we propose a novel twin-surrogate explainability framework to explain the decisions made by any CNN-based image classifier (irrespective of the architecture). For this, we first disentangle latent features from the classifier, followed by aligning these features to observed/human-defined `context' features. These aligned features form semantically meaningful concepts that are used for extracting a causal graph depicting the `perceived' data-generating process, describing the inter- and intra-feature interactions between unobserved latent features and observed `context' features. This causal graph serves as a global model from which local explanations of different forms can be extracted. Specifically, we provide a generator to visualize the `effect' of interactions among features in latent space and draw feature importance therefrom as local explanations. Our framework utilizes adversarial knowledge distillation to faithfully learn a representation from the classifiers' latent space and use it for extracting visual explanations. We use the styleGAN-v2 architecture with an additional regularization term to enforce disentanglement and alignment. We demonstrate and evaluate explanations obtained with our framework on Morpho-MNIST and on the FFHQ human faces dataset. Our framework is available at \url{https://github.com/koriavinash1/GLANCE-Explanations}.

翻译：然而,鉴于模型和数据生成过程的复杂性,由此产生的解释远远不是“完整的”,因为它们缺乏特征互动和“效应”的可视化特征的标志。在这项工作中,我们提出一个新的双向解释框架,以解释任何以CNN为基础的图像分类师(不论结构结构如何)所作的决定。为此,我们首先从分类器中分解潜伏特征,然后将这些特征与观察到的/人定义的“同源”特征相匹配。这些对等特征构成了具有实际意义的概念,用于提取描述“明显”数据生成过程的因果图,描述未观察到的与观察到的“同源”图像分类仪之间的内在和内在的相互作用。我们提出的双向解释是一个全球模型,可以从中提取对不同形式的当地解释。具体地说,我们提供一台发电机,将潜在空间特征之间的“效果”进行直观化,并绘制关于当地解释的特征的重要性。我们的框架利用了一种真实的、真实的、真实的、真实的、真实的、真实的、真实的、真实的、真实的、真实的、真实的、真实的、真实的、真实的、真实的、真实的、解释性解释来解释。