Large vision-language models (LVLMs) have shown impressive performance across a broad range of multimodal tasks. However, robust image caption evaluation using LVLMs remains challenging, particularly under domain-shift scenarios. To address this issue, we introduce the Distribution-Aware Score Decoder (DISCODE), a novel finetuning-free method that generates robust evaluation scores better aligned with human judgments across diverse domains. The core idea behind DISCODE lies in its test-time adaptive evaluation approach, which introduces the Adaptive Test-Time (ATT) loss, leveraging a Gaussian prior distribution to improve robustness in evaluation score estimation. This loss is efficiently minimized at test time using an analytical solution that we derive. Furthermore, we introduce the Multi-domain Caption Evaluation (MCEval) benchmark, a new image captioning evaluation benchmark covering six distinct domains, designed to assess the robustness of evaluation metrics. In our experiments, we demonstrate that DISCODE achieves state-of-the-art performance as a reference-free evaluation metric across MCEval and four representative existing benchmarks.
翻译:大规模视觉语言模型(LVLMs)在广泛的跨模态任务中展现出卓越的性能。然而,利用LVLMs进行鲁棒的图像描述评估仍具挑战性,尤其是在领域偏移场景下。为解决此问题,我们提出了分布感知分数解码器(DISCODE),一种无需微调的新方法,能够生成与人类判断在多样化领域中更一致且鲁棒的评估分数。DISCODE的核心思想在于其测试时自适应评估方法,该方法引入了自适应测试时(ATT)损失,利用高斯先验分布提升评估分数估计的鲁棒性。该损失通过我们推导出的解析解在测试时高效最小化。此外,我们提出了多领域描述评估(MCEval)基准,这是一个涵盖六个不同领域的新图像描述评估基准,旨在评估评价指标的鲁棒性。实验表明,DISCODE在MCEval及四个代表性现有基准上作为无参考评估指标取得了最先进的性能。