Existing evaluation protocols for brain visual decoding predominantly rely on coarse metrics that obscure inter-model differences, lack neuroscientific foundation, and fail to capture fine-grained visual distinctions. To address these limitations, we introduce BASIC, a unified, multigranular evaluation framework that jointly quantifies structural fidelity, inferential alignment, and contextual coherence between decoded and ground-truth images. For the structural level, we introduce a hierarchical suite of segmentation-based metrics, including foreground, semantic, instance, and component masks, anchored in granularity-aware correspondence across mask structures. For the semantic level, we extract structured scene representations encompassing objects, attributes, and relationships using multimodal large language models, enabling detailed, scalable, and context-rich comparisons with ground-truth stimuli. We benchmark a diverse set of visual decoding methods across multiple stimulus-neuroimaging datasets within this unified evaluation framework. Together, these criteria provide a more discriminative, interpretable, and comprehensive foundation for evaluating brain visual decoding methods.
翻译:现有的脑视觉解码评估协议主要依赖于粗粒度的度量标准,这些标准掩盖了模型间的差异,缺乏神经科学基础,且未能捕捉细粒度的视觉区分。为解决这些局限性,我们提出了BASIC,一个统一的多粒度评估框架,该框架联合量化解码图像与真实图像之间的结构保真度、推理对齐性和上下文一致性。在结构层面,我们引入了一套基于分割的层次化度量套件,包括前景、语义、实例和组件掩码,这些掩码基于掩码结构间粒度感知的对应关系。在语义层面,我们利用多模态大语言模型提取包含对象、属性和关系的结构化场景表示,从而能够与真实刺激进行详细、可扩展且上下文丰富的比较。在此统一评估框架内,我们在多个刺激-神经影像数据集上对多种视觉解码方法进行了基准测试。这些标准共同为评估脑视觉解码方法提供了一个更具区分性、可解释性和全面性的基础。