Vision transformer (ViT) expands the success of transformer models from sequential data to images. The model decomposes an image into many smaller patches and arranges them into a sequence. Multi-head self-attentions are then applied to the sequence to learn the attention between patches. Despite many successful interpretations of transformers on sequential data, little effort has been devoted to the interpretation of ViTs, and many questions remain unanswered. For example, among the numerous attention heads, which one is more important? How strong are individual patches attending to their spatial neighbors in different heads? What attention patterns have individual heads learned? In this work, we answer these questions through a visual analytics approach. Specifically, we first identify what heads are more important in ViTs by introducing multiple pruning-based metrics. Then, we profile the spatial distribution of attention strengths between patches inside individual heads, as well as the trend of attention strengths across attention layers. Third, using an autoencoder-based learning solution, we summarize all possible attention patterns that individual heads could learn. Examining the attention strengths and patterns of the important heads, we answer why they are important. Through concrete case studies with experienced deep learning experts on multiple ViTs, we validate the effectiveness of our solution that deepens the understanding of ViTs from head importance, head attention strength, and head attention pattern.
翻译:视觉Transformer(ViT)将Transformer模型从结构化数据扩展到图像领域。该模型将图像分解为许多较小的图块,并将它们排列成一个序列。之后,对序列应用多头自我注意力机制,以学习图块之间的注意力。尽管Transformer对结构化数据的成功解释已经有很多,但对于ViT的解释付出的努力还很少,许多问题仍未解决。例如,在众多的注意力头中,哪一个更重要?不同头中各个图块相互注意的强度有多大?单个头部学习了什么注的型式 ?在本工作中,我们通过一种视觉化分析方法来回答这些问题。具体地,我们首先通过引入多个基于修剪的度量标准,确定了哪些头部在ViTs中更为重要;然后,我们在单个头部内对图块之间的注意力强度以及注意层中注意力强度的趋势进行了空间分布的分析。第三,使用自编码器为学习解决方案,我们总结了各个头部可能学习的所有可能的注意型式 。通过检查重要头部的注意强度和模式,我们回答了为什么它们很重要的问题,我们在多个ViTs上通过深入研究与有经验的深度学习专家的具体案例研究,验证了我们的解决方案的有效性以及深入了解ViTs的头部重要性,头部注意力强度以及头部注意型式方面。