Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated significant potential in recommendation systems. However, the effective application of MLLMs to multimodal sequential recommendation remains unexplored: A) Existing methods primarily leverage the multimodal semantic understanding capabilities of pre-trained MLLMs to generate item embeddings or semantic IDs, thereby enhancing traditional recommendation models. These approaches generate item representations that exhibit limited interpretability, and pose challenges when transferring to language model-based recommendation systems. B) Other approaches convert user behavior sequence into image-text pairs and perform recommendation through multiple MLLM inference, incurring prohibitive computational and time costs. C) Current MLLM-based recommendation systems generally neglect the integration of collaborative signals. To address these limitations while balancing recommendation performance, interpretability, and computational cost, this paper proposes MultiModal Summarization-and-Retrieval-Augmented Sequential Recommendation. Specifically, we first employ MLLM to summarize items into concise keywords and fine-tune the model using rewards that incorporate summary length, information loss, and reconstruction difficulty, thereby enabling adaptive adjustment of the summarization policy. Inspired by retrieval-augmented generation, we then transform collaborative signals into corresponding keywords and integrate them as supplementary context. Finally, we apply supervised fine-tuning with multi-task learning to align the MLLM with the multimodal sequential recommendation. Extensive evaluations on common recommendation datasets demonstrate the effectiveness of MMSRARec, showcasing its capability to efficiently and interpretably understand user behavior histories and item information for accurate recommendations.
翻译:近年来,多模态大语言模型(MLLMs)的进展在推荐系统中展现出巨大潜力。然而,如何将 MLLMs 有效应用于多模态序列推荐仍是一个未充分探索的领域:A)现有方法主要利用预训练 MLLMs 的多模态语义理解能力来生成物品嵌入或语义 ID,从而增强传统推荐模型。这些方法生成的物品表示可解释性有限,且在迁移到基于语言模型的推荐系统时面临挑战。B)其他方法将用户行为序列转换为图文对,并通过多次 MLLM 推理进行推荐,导致极高的计算与时间成本。C)当前基于 MLLM 的推荐系统普遍忽视了协同信号的整合。为应对上述局限,并在推荐性能、可解释性与计算成本之间取得平衡,本文提出了多模态摘要与检索增强序列推荐方法。具体而言,我们首先利用 MLLM 将物品总结为简洁关键词,并通过融合摘要长度、信息损失与重构难度的奖励机制对模型进行微调,从而实现摘要策略的自适应调整。受检索增强生成思想的启发,我们将协同信号转化为相应关键词,并将其作为补充上下文进行整合。最后,我们采用多任务学习的监督微调方法,使 MLLM 与多模态序列推荐任务对齐。在多个常用推荐数据集上的广泛评估验证了 MMSRARec 的有效性,表明其能够高效且可解释地理解用户行为历史与物品信息,从而实现精准推荐。