Semantic retrieval is crucial for modern applications yet remains underexplored in current research. Existing datasets are limited to single languages, single images, or singular retrieval conditions, often failing to fully exploit the expressive capacity of visual information as evidenced by maintained performance when images are replaced with captions. However, practical retrieval scenarios frequently involve interleaved multi-condition queries with multiple images. Hence, this paper introduces MERIT, the first multilingual dataset for interleaved multi-condition semantic retrieval, comprising 320,000 queries with 135,000 products in 5 languages, covering 7 distinct product categories. Extensive experiments on MERIT identify existing models's limitation: focusing solely on global semantic information while neglecting specific conditional elements in queries. Consequently, we propose Coral, a novel fine-tuning framework that adapts pre-trained MLLMs by integrating embedding reconstruction to preserve fine-grained conditional elements and contrastive learning to extract comprehensive global semantics. Experiments demonstrate that Coral achieves a 45.9% performance improvement over conventional approaches on MERIT, with strong generalization capabilities validated across 8 established retrieval benchmarks. Collectively, our contributions - a novel dataset, identification of critical limitations in existing approaches, and an innovative fine-tuning framework - establish a foundation for future research in interleaved multi-condition semantic retrieval.
翻译:语义检索在现代应用中至关重要,但在当前研究中仍未被充分探索。现有数据集局限于单一语言、单一图像或单一检索条件,往往无法充分利用视觉信息的表达能力,这一点在图像被替换为文字描述时性能仍得以保持的现象中得以证实。然而,实际检索场景常涉及包含多张图像的交错多条件查询。因此,本文提出了MERIT,这是首个用于交错多条件语义检索的多语言数据集,包含5种语言、7个不同产品类别下的32万条查询和13.5万件商品。在MERIT上进行的大量实验揭示了现有模型的局限性:仅关注全局语义信息而忽视了查询中的特定条件元素。为此,我们提出了Coral,一种新颖的微调框架,通过集成嵌入重建以保留细粒度条件元素,并结合对比学习以提取全面的全局语义,从而适配预训练的多模态大语言模型。实验表明,Coral在MERIT上相比传统方法实现了45.9%的性能提升,并在8个成熟的检索基准测试中验证了其强大的泛化能力。总体而言,我们的贡献——一个新数据集、对现有方法关键局限性的识别以及创新的微调框架——为未来交错多条件语义检索的研究奠定了基础。