Evaluating multimodal large language models (MLLMs) is fundamentally challenged by the absence of structured, interpretable, and theoretically grounded benchmarks; current heuristically-grouped tasks have vague cognitive targets, overlapping abilities, redundant indicators, and weak diagnostic power. We therefore propose a structural-equation-modeling-aligned framework that quantifies internal validity, dimensional separability, and component contributions, and introduce a Piaget-inspired capability hierarchy that stratifies MLLM abilities into Perception, Memory, and Reasoning. Reorganizing existing tasks under this theory, we build the GOLD benchmark, whose experiments show superior interpretability, lower indicator redundancy, and clearer cognitive consistency than prior benchmarks.
翻译:评估多模态大语言模型(MLLMs)面临根本性挑战,即缺乏结构化、可解释且理论依据充分的基准;当前基于启发式分组的任务存在认知目标模糊、能力重叠、指标冗余及诊断能力薄弱等问题。为此,我们提出一种基于结构方程建模的对齐框架,用于量化内部效度、维度可分性及组件贡献度,并引入受皮亚杰启发的认知能力层级结构,将MLLM能力划分为感知、记忆与推理三个层次。通过在此理论框架下重组现有任务,我们构建了GOLD基准,实验表明其相较于现有基准具有更优的可解释性、更低的指标冗余度以及更清晰的认知一致性。