Attention modules connecting encoder and decoders have been widely applied in the field of object recognition, image captioning, visual question answering and neural machine translation, and significantly improves the performance. In this paper, we propose a bottom-up gated hierarchical attention (GHA) mechanism for image captioning. Our proposed model employs a CNN as the decoder which is able to learn different concepts at different layers, and apparently, different concepts correspond to different areas of an image. Therefore, we develop the GHA in which low-level concepts are merged into high-level concepts and simultaneously low-level attended features pass to the top to make predictions. Our GHA significantly improves the performance of the model that only applies one level attention, for example, the CIDEr score increases from 0.923 to 0.999, which is comparable to the state-of-the-art models that employ attributes boosting and reinforcement learning (RL). We also conduct extensive experiments to analyze the CNN decoder and our proposed GHA, and we find that deeper decoders cannot obtain better performance, and when the convolutional decoder becomes deeper the model is likely to collapse during training.
翻译:将编码器和解码器连接起来的注意模块被广泛应用于物体识别、图像字幕、视觉问题解答和神经机翻译领域,并大大改进了性能。在本文中,我们建议对图像字幕采用自下而上的分层注意机制(GHA)。我们提议的模型使用CNN作为解码器,能够在不同层次学习不同概念,而且显然不同概念与不同图像领域相对应。因此,我们开发了GHA,其中将低级别概念合并为高层次概念,同时将低级别参与特征传送到顶层以作出预测。我们的GHA显著改进了只应用一个层次注意的模型的性能,例如CIDER分数从0.923升至0.99,这与利用属性增强和强化学习的状态(RL)的艺术模型相近。我们还进行了广泛的实验,以分析CNN解码器和我们提议的GHA,我们发现更深层次的解码器无法取得更好的性能,而在更深层次的分解码器在培训期间有可能崩溃。