Recent works in image captioning have shown very promising raw performance. However, we realize that most of these encoder-decoder style networks with attention do not scale naturally to large vocabulary size, making them difficult to be deployed on embedded system with limited hardware resources. This is because the size of word and output embedding matrices grow proportionally with the size of vocabulary, adversely affecting the compactness of these networks. To address this limitation, this paper introduces a brand new idea in the domain of image captioning. That is, we tackle the problem of compactness of image captioning models which is hitherto unexplored. We showed that, our proposed model, named COMIC for COMpact Image Captioning, achieves comparable results in five common evaluation metrics with state-of-the-art approaches on both MS-COCO and InstaPIC-1.1M datasets despite having an embedding vocabulary size that is 39x - 99x smaller
翻译:然而,我们意识到,大多数关注的编码器-解码器风格网络并不自然地向大型词汇大小扩展,因此难以在硬件资源有限的嵌入系统上部署。这是因为,单词和输出嵌入矩阵的大小随着词汇的大小成比例增长,从而对这些网络的缩缩缩性产生了不利影响。为了应对这一限制,本文件在图像字幕领域引入了全新的理念。也就是说,我们解决了迄今为止尚未探索的图像字幕模型的缩缩略性问题。我们展示了我们提议的模型,名为COMPACT 图像捕捉 COMIC,在五种通用评价指标中取得了类似的结果,而MS-COCO和InstaPIC-1.1M数据集都采用了最先进的方法,尽管嵌入的词汇大小为39x-99x小。