We investigate the behaviour of attention in neural models of visually grounded speech trained on two languages: English and Japanese. Experimental results show that attention focuses on nouns and this behaviour holds true for two very typologically different languages. We also draw parallels between artificial neural attention and human attention and show that neural attention focuses on word endings as it has been theorised for human attention. Finally, we investigate how two visually grounded monolingual models can be used to perform cross-lingual speech-to-speech retrieval. For both languages, the enriched bilingual (speech-image) corpora with part-of-speech tags and forced alignments are distributed to the community for reproducible research.
翻译:我们调查了两种语言:英语和日语培训的视觉辅助语言神经模型中的注意行为。实验结果表明,注意力集中在名词上,这种行为在两种非常典型的不同语言中是有道理的。我们还把人工神经注意力和人类注意力相提并论,并表明神经注意力集中在文字结尾上,因为它是人类注意力的理论理论。最后,我们调查了两种视觉单一语言模型如何用来进行跨语言语音检索。对于两种语言来说,丰富的双语(语音图像)公司以及部分语音标记和强制校正都分发给社区,以便进行可复制的研究。