DISCODE：用于图像描述鲁棒自动评估的分布感知分数解码器 (DISCODE: Distribution-Aware Score Decoder for Robust Automatic Evaluation of Image Captioning)

Large vision-language models (LVLMs) have shown impressive performance across a broad range of multimodal tasks. However, robust image caption evaluation using LVLMs remains challenging, particularly under domain-shift scenarios. To address this issue, we introduce the Distribution-Aware Score Decoder (DISCODE), a novel finetuning-free method that generates robust evaluation scores better aligned with human judgments across diverse domains. The core idea behind DISCODE lies in its test-time adaptive evaluation approach, which introduces the Adaptive Test-Time (ATT) loss, leveraging a Gaussian prior distribution to improve robustness in evaluation score estimation. This loss is efficiently minimized at test time using an analytical solution that we derive. Furthermore, we introduce the Multi-domain Caption Evaluation (MCEval) benchmark, a new image captioning evaluation benchmark covering six distinct domains, designed to assess the robustness of evaluation metrics. In our experiments, we demonstrate that DISCODE achieves state-of-the-art performance as a reference-free evaluation metric across MCEval and four representative existing benchmarks.

翻译：大规模视觉语言模型（LVLMs）在广泛的跨模态任务中展现出卓越的性能。然而，利用LVLMs进行鲁棒的图像描述评估仍具挑战性，尤其是在领域偏移场景下。为解决此问题，我们提出了分布感知分数解码器（DISCODE），一种无需微调的新方法，能够生成与人类判断在多样化领域中更一致且鲁棒的评估分数。DISCODE的核心思想在于其测试时自适应评估方法，该方法引入了自适应测试时（ATT）损失，利用高斯先验分布提升评估分数估计的鲁棒性。该损失通过我们推导出的解析解在测试时高效最小化。此外，我们提出了多领域描述评估（MCEval）基准，这是一个涵盖六个不同领域的新图像描述评估基准，旨在评估评价指标的鲁棒性。实验表明，DISCODE在MCEval及四个代表性现有基准上作为无参考评估指标取得了最先进的性能。