Vision-Language Pretraining (VLP) has achieved remarkable success across various downstream tasks, but such gains are largely driven by scaling up on training data. Yet, literature methods treat image-text pairs as isolated training examples; this neglects the rich relational structure naturally present in many domains, such as e-commerce product co-purchase graphs and social recommendation networks. Inspired by neuroscientific evidence that human encodes knowledge as relationship cognitive maps, we introduce Structure-aware Language-Image Pretraining (SLIP). SLIP integrates a structural contrastive loss to align modalities while also modeling relationships between neighboring entities in a structured graph. To support this paradigm, we construct a large-scale Amazon Product Co-purchase Multimodal Graph Dataset, enabling structured cross-modality supervision at scale. Experiment results show that SLIP consistently outperforms CLIP on cross-modal retrieval and classification tasks in both zero-shot and few-shot settings, showing the value of relational supervision for cross-modal alignment.
翻译:视觉-语言预训练(VLP)已在多种下游任务中取得显著成功,但这些进展主要依赖于训练数据规模的扩大。现有方法通常将图像-文本对视为孤立的训练样本,忽略了诸多领域(如电子商务产品共购图、社交推荐网络)中天然存在的丰富关系结构。受神经科学证据启发——人类将知识编码为关系认知图谱,我们提出了结构感知语言-图像预训练(SLIP)。SLIP通过引入结构对比损失,在实现多模态对齐的同时,对结构化图中相邻实体间的关系进行建模。为支持该范式,我们构建了大规模亚马逊产品共购多模态图数据集,实现了结构化跨模态监督的大规模应用。实验结果表明,在零样本和少样本设置下,SLIP在跨模态检索与分类任务中均持续优于CLIP,验证了关系监督对跨模态对齐的价值。