Generative recommendation models often struggle with two key challenges: (1) the superficial integration of collaborative signals, and (2) the decoupled fusion of multimodal features. These limitations hinder the creation of a truly holistic item representation. To overcome this, we propose CEMG, a novel Collaborative-Enhaned Multimodal Generative Recommendation framework. Our approach features a Multimodal Fusion Layer that dynamically integrates visual and textual features under the guidance of collaborative signals. Subsequently, a Unified Modality Tokenization stage employs a Residual Quantization VAE (RQ-VAE) to convert this fused representation into discrete semantic codes. Finally, in the End-to-End Generative Recommendation stage, a large language model is fine-tuned to autoregressively generate these item codes. Extensive experiments demonstrate that CEMG significantly outperforms state-of-the-art baselines.
翻译:生成式推荐模型通常面临两个关键挑战:(1) 协同信号的浅层整合,以及(2) 多模态特征的解耦融合。这些限制阻碍了真正全面的物品表征的构建。为克服这些问题,我们提出了CEMG,一种新颖的协同增强多模态生成式推荐框架。我们的方法采用了一个多模态融合层,在协同信号的指导下动态整合视觉与文本特征。随后,统一的模态标记化阶段利用残差量化变分自编码器(RQ-VAE)将这种融合表征转换为离散的语义代码。最后,在端到端生成式推荐阶段,通过微调一个大语言模型来自回归地生成这些物品代码。大量实验表明,CEMG显著优于现有的先进基线方法。