Image captioning systems have made substantial progress, largely due to the availability of curated datasets like Microsoft COCO or Vizwiz that have accurate descriptions of their corresponding images. Unfortunately, scarce availability of such cleanly labeled data results in trained algorithms producing captions that can be terse and idiosyncratically specific to details in the image. We propose a new technique, cooperative distillation that combines clean curated datasets with the web-scale automatically extracted captions of the Google Conceptual Captions dataset (GCC), which can have poor descriptions of images, but is abundant in size and therefore provides a rich vocabulary resulting in more expressive captions.
翻译:图像字幕系统已取得重大进展,这主要是因为有微软COCO或Vizwiz等对相应图像有准确描述的分类数据集。 不幸的是,这种贴有干净标签的数据很少,导致经过培训的算法生成了可与图像细节具体相关的说明。 我们建议采用一种新的技术,即合作蒸馏法,将清洁的分类数据集与谷歌概念字幕数据集(GCC)的网络规模自动提取的分类集结合起来,这些分类集对图像描述不甚清楚,但尺寸丰富,因此提供了丰富的词汇,导致更清晰的描述。