超越记忆：揭示视觉-语言模型中流行度偏差的多模态序数回归基准 (Beyond Memorization: A Multi-Modal Ordinal Regression Benchmark to Expose Popularity Bias in Vision-Language Models)

We expose a significant popularity bias in state-of-the-art vision-language models (VLMs), which achieve up to 34% higher accuracy on famous buildings compared to ordinary ones, indicating a reliance on memorization over generalizable understanding. To systematically investigate this, we introduce the largest open benchmark for this task: the YearGuessr dataset, a collection of 55,546 building images with multi-modal attributes from 157 countries, annotated with continuous ordinal labels of their construction year (1001-2024), GPS data, and page-view counts as a proxy for popularity. Using this dataset, we frame the construction year prediction task as ordinal regression and introduce popularity-aware interval accuracy metrics to quantify this bias. Our resulting benchmark of 30+ models, including our YearCLIP model, confirms that VLMs excel on popular, memorized items but struggle significantly with unrecognized subjects, exposing a critical flaw in their reasoning capabilities. Project page: https://sytwu.github.io/BeyondMemo/

翻译：我们揭示了当前最先进的视觉-语言模型存在显著的流行度偏差：与普通建筑相比，模型在著名建筑上的准确率最高可高出34%，这表明其依赖于记忆而非可泛化的理解。为系统研究此问题，我们为此任务引入了最大的开放基准：YearGuessr数据集。该数据集包含来自157个国家的55,546张建筑图像，具备多模态属性，并标注了其建造年份的连续序数标签（1001-2024年）、GPS数据以及作为流行度代理的页面浏览量。利用此数据集，我们将建造年份预测任务构建为序数回归问题，并引入了考虑流行度的区间准确率指标以量化此偏差。我们对30多个模型（包括我们提出的YearCLIP模型）的基准测试结果证实，视觉-语言模型在流行、被记忆的项目上表现出色，但在未被识别的主题上表现显著不佳，这暴露了其推理能力中的一个关键缺陷。项目页面：https://sytwu.github.io/BeyondMemo/