We introduce two new benchmarks REST and REST+(Render-Equivalence Stress Tests) to enable systematic evaluation of cross-modal inconsistency in multimodal large language models (MLLMs). MLLMs are trained to represent vision and language in the same embedding space, yet they cannot perform the same tasks in both modalities. Our benchmarks contain samples with the same semantic information in three modalities (image, text, mixed) and we show that state-of-the-art MLLMs cannot consistently reason over these different modalities. We evaluate 15 MLLMs and find that the degree of modality inconsistency varies substantially, even when accounting for problems with text recognition (OCR). Neither rendering text as image nor rendering an image as text solves the inconsistency. Even if OCR is correct, we find that visual characteristics (text colour and resolution, but not font) and the number of vision tokens have an impact on model performance. Finally, we find that our consistency score correlates with the modality gap between text and images, highlighting a mechanistic interpretation of cross-modal inconsistent MLLMs.
翻译:我们引入了两个新基准REST和REST+(渲染等价性压力测试),以系统评估多模态大语言模型(MLLMs)中的跨模态不一致性。MLLMs被训练为在相同嵌入空间中表示视觉和语言信息,但它们无法在两种模态中执行相同任务。我们的基准包含三种模态(图像、文本、混合)中具有相同语义信息的样本,并证明当前最先进的MLLMs无法在这些不同模态间保持一致的推理能力。我们评估了15个MLLM,发现即使考虑文本识别(OCR)问题,模态不一致程度仍存在显著差异。无论是将文本渲染为图像,还是将图像渲染为文本,均无法解决不一致性问题。即使OCR正确,我们也发现视觉特征(文本颜色和分辨率,但非字体)和视觉标记数量会影响模型性能。最后,我们发现一致性分数与文本和图像之间的模态差距相关,这揭示了跨模态不一致MLLMs的机制性解释。