Cross-modal retrieval is essential for interpreting cultural heritage data, but its effectiveness is often limited by incomplete or inconsistent textual descriptions, caused by historical data loss and the high cost of expert annotation. While large language models (LLMs) offer a promising solution by enriching textual descriptions, their outputs frequently suffer from hallucinations or miss visually grounded details. To address these challenges, we propose $C^3$, a data augmentation framework that enhances cross-modal retrieval performance by improving the completeness and consistency of LLM-generated descriptions. $C^3$ introduces a completeness evaluation module to assess semantic coverage using both visual cues and language-model outputs. Furthermore, to mitigate factual inconsistencies, we formulate a Markov Decision Process to supervise Chain-of-Thought reasoning, guiding consistency evaluation through adaptive query control. Experiments on the cultural heritage datasets CulTi and TimeTravel, as well as on general benchmarks MSCOCO and Flickr30K, demonstrate that $C^3$ achieves state-of-the-art performance in both fine-tuned and zero-shot settings.
翻译:跨模态检索对于解读文化遗产数据至关重要,但其效果常因文本描述不完整或不一致而受限,这源于历史数据缺失和专家标注成本高昂。尽管大语言模型(LLMs)通过丰富文本描述提供了有前景的解决方案,但其输出常存在幻觉或缺乏视觉细节。为应对这些挑战,我们提出 $C^3$ 框架,通过提升LLM生成描述的完整性与一致性来增强跨模态检索性能。$C^3$ 引入完整性评估模块,结合视觉线索与语言模型输出来评估语义覆盖度。此外,为减少事实不一致性,我们构建马尔可夫决策过程以监督思维链推理,通过自适应查询控制引导一致性评估。在文化遗产数据集 CulTi 和 TimeTravel,以及通用基准 MSCOCO 和 Flickr30K 上的实验表明,$C^3$ 在微调和零样本设置下均实现了最先进的性能。