Despite the remarkable reasoning abilities of large vision-language models (LVLMs), their robustness under visual corruptions remains insufficiently studied. Existing evaluation paradigms exhibit two major limitations: 1) the dominance of low-discriminative samples in current datasets masks the real robustness gap between models; and 2) conventional accuracy-based metric fail to capture the degradation of the underlying prediction structure. To bridge these gaps, we introduce Bench-C, a comprehensive benchmark emphasizing discriminative samples for assessing corruption robustness, where a selection strategy is proposed to jointly consider the prediction inconsistency under corruption and the semantic diversity. Furthermore, we propose the Robustness Alignment Score (RAS), a unified metric that measures degradation in logit-level prediction structure by considering the shifts in prediction uncertainty and calibration alignment. Comprehensive experiments and analysis reveal several interesting findings: 1) model behaviors exhibit distinguish patterns under corruptions, such as erroneous confidence and hesitation; 2) despite subtle corruption may lead to a slight accuracy gain, the overall prediction structure still degrades; 3) by decomposing corruption robustness into destructive and corrective components, the distinct failure and recovery patterns across models can be revealed.
翻译:尽管大型视觉语言模型(LVLMs)展现出卓越的推理能力,但其在视觉干扰下的鲁棒性仍未得到充分研究。现有评估范式存在两大局限:1)当前数据集中低判别性样本占主导地位,掩盖了模型间真实的鲁棒性差距;2)传统基于准确率的度量方法无法捕捉底层预测结构的退化。为弥补这些不足,我们提出了Bench-C——一个强调判别性样本的综合性基准,用于评估抗干扰鲁棒性,其中通过联合考虑干扰下的预测不一致性与语义多样性,提出了一种样本筛选策略。此外,我们提出了鲁棒性对齐分数(RAS),这是一种通过考量预测不确定性与校准对齐的偏移来度量对数层面预测结构退化的统一指标。全面的实验与分析揭示了若干重要发现:1)模型在干扰下表现出差异化行为模式,如错误置信与决策犹豫;2)尽管轻微干扰可能导致准确率微升,但整体预测结构仍发生退化;3)通过将抗干扰鲁棒性分解为破坏性与修正性成分,可揭示不同模型间独特的失效与恢复模式。