Large Multimodal Models (LMMs) encode rich factual knowledge via cross-modal pre-training, yet their static representations struggle to maintain an accurate understanding of time-sensitive factual knowledge. Existing benchmarks remain constrained by static designs, inadequately evaluating LMMs' ability to understand time-sensitive knowledge. To address this gap, we propose MINED, a comprehensive benchmark that evaluates temporal awareness along 6 key dimensions and 11 challenging tasks: cognition, awareness, trustworthiness, understanding, reasoning, and robustness. MINED is constructed from Wikipedia by two professional annotators, containing 2,104 time-sensitive knowledge samples spanning six knowledge types. Evaluating 15 widely used LMMs on MINED shows that Gemini-2.5-Pro achieves the highest average CEM score of 63.07, while most open-source LMMs still lack time understanding ability. Meanwhile, LMMs perform best on organization knowledge, whereas their performance is weakest on sport. To address these challenges, we investigate the feasibility of updating time-sensitive knowledge in LMMs through knowledge editing methods and observe that LMMs can effectively update knowledge via knowledge editing methods in single editing scenarios.
翻译:大型多模态模型通过跨模态预训练编码了丰富的事实知识,但其静态表征难以维持对时序敏感事实知识的准确理解。现有基准测试仍受限于静态设计,未能充分评估大型多模态模型理解时序知识的能力。为填补这一空白,我们提出MINED——一个从认知、感知、可信度、理解、推理和鲁棒性这6个关键维度及11项挑战性任务评估时序感知能力的综合性基准。MINED由两名专业标注者基于维基百科构建,包含涵盖六类知识类型的2104个时序敏感知识样本。在MINED上对15个常用大型多模态模型的评估表明,Gemini-2.5-Pro以63.07的平均CEM得分位居首位,而多数开源模型仍缺乏时序理解能力。同时,模型在组织类知识上表现最佳,在体育类知识上表现最弱。针对这些挑战,我们通过知识编辑方法探究了更新大型多模态模型中时序知识的可行性,发现模型在单次编辑场景中能通过知识编辑方法有效更新知识。