Scaled dot-product attention applies a softmax function on the scaled dot-product of queries and keys to calculate weights and then multiplies the weights and values. In this work, we study how to improve the learning of scaled dot-product attention to improve the accuracy of DETR. Our method is based on the following observations: using ground truth foreground-background mask (GT Fg-Bg Mask) as additional cues in the weights/values learning enables learning much better weights/values; with better weights/values, better values/weights can be learned. We propose a triple-attention module in which the first attention is a plain scaled dot-product attention, the second/third attention generates high-quality weights/values (with the assistance of GT Fg-Bg Mask) and shares the values/weights with the first attention to improve the quality of values/weights. The second and third attentions are removed during inference. We call our method knowledge-sharing DETR (KS-DETR), which is an extension of knowledge distillation (KD) in the way that the improved weights and values of the teachers (the second and third attentions) are directly shared, instead of mimicked, by the student (the first attention) to enable more efficient knowledge transfer from the teachers to the student. Experiments on various DETR-like methods show consistent improvements over the baseline methods on the MS COCO benchmark. Code is available at https://github.com/edocanonymous/KS-DETR.
翻译:降 dot product 注意力在查询和键的分量产品上应用软式的DEAX函数来计算重量,然后将重量和值乘以倍增。 在这项工作中,我们研究如何改进对分量产品注意的学习,以提高 DETR 的准确性。 我们的方法基于以下观察: 使用地面地底地底掩码( GT Fg- Bg Mask ) 作为重/ 值学习中的额外提示, 学习重量/ 值可以学习更好的重量/ 值; 有了更好的重量/ 键来计算重量和 倍数, 就可以学习更好的 。 我们建议了一个三重关注模块, 其中第一个关注是简单的点点注意, 第二/ 关注点( K) 产生高质量的重量/ 值( 以 GT Fg- Bg Mask 协助), 与第一个关注点一起提高值/ 重量。 在发酵过程中, 第二和第三个关注点将我们的方法分享 DETR ( K- DETR), 这是将知识的精度的精度提升到 的精度排序的精度的精度 。