Image captioning has conventionally relied on reference-based automatic evaluations, where machine captions are compared against captions written by humans. This is in stark contrast to the reference-free manner in which humans assess caption quality. In this paper, we report the surprising empirical finding that CLIP (Radford et al., 2021), a cross-modal model pretrained on 400M image+caption pairs from the web, can be used for robust automatic evaluation of image captioning without the need for references. Experiments spanning several corpora demonstrate that our new reference-free metric, CLIPScore, achieves the highest correlation with human judgements, outperforming existing reference-based metrics like CIDEr and SPICE. Information gain experiments demonstrate that CLIPScore, with its tight focus on image-text compatibility, is complementary to existing reference-based metrics that emphasize text-text similarities. Thus, we also present a reference-augmented version, RefCLIPScore, which achieves even higher correlation. Beyond literal description tasks, several case studies reveal domains where CLIPScore performs well (clip-art images, alt-text rating), but also where it is relatively weaker vs reference-based metrics, e.g., news captions that require richer contextual knowledge.
翻译:图像字幕通常依赖于基于参考的自动评价, 机器字幕与人类撰写的字幕相比较。 这与人类评估字幕质量的无参考性方式形成鲜明对比。 在本文中, 我们报告令人惊讶的经验发现, CLIP( Radford 等人, 2021年), 一种在网上400M 图像加插配对上预先训练的交叉模式, 可以用来对图像字幕进行强有力的自动评价, 而不需要参考。 跨多个公司实验显示, 我们新的无参考性指标( CLIPScore) 实现了与人类判断的最高相关性, 超过了现有的基于参考性指标( 如 CIDER 和 SPICE ) 。 信息获取实验表明, CLIPSC( ) 以图像- 文本兼容性为紧凑合一的基于参考性指标, 是对强调文本相似性的现有参考性指标( RefCLIPSc) 的补充。 因此, 我们还提出了一个参考性版本, RefCLIPScore, 实现更高的相关性。 除了简单的描述任务外, 一些案例研究还揭示了 CLIPSC 的域, 其中要求相对的图表。