Ranking relevance is a fundamental task in search engines, aiming to identify the items most relevant to a given user query. Traditional relevance models typically produce scalar scores or directly predict relevance labels, limiting both interpretability and the modeling of complex relevance signals. Inspired by recent advances in Chain-of-Thought (CoT) reasoning for complex tasks, we investigate whether explicit reasoning can enhance both interpretability and performance in relevance modeling. However, existing reasoning-based Generative Relevance Models (GRMs) primarily rely on supervised fine-tuning on large amounts of human-annotated or synthetic CoT data, which often leads to limited generalization. Moreover, domain-agnostic, free-form reasoning tends to be overly generic and insufficiently grounded, limiting its potential to handle the diverse and ambiguous cases prevalent in open-domain search. In this work, we formulate relevance modeling in Xiaohongshu search as a reasoning task and introduce a Reinforcement Learning (RL)-based training framework to enhance the grounded reasoning capabilities of GRMs. Specifically, we incorporate practical business-specific relevance criteria into the multi-step reasoning prompt design and propose Stepwise Advantage Masking (SAM), a lightweight process-supervision strategy which facilitates effective learning of these criteria through improved credit assignment. To enable industrial deployment, we further distill the large-scale RL-tuned model to a lightweight version suitable for real-world search systems. Extensive offline evaluations and online A/B tests demonstrate that our approach consistently delivers significant improvements across key relevance and business metrics, validating its effectiveness, robustness, and practicality for large-scale industrial search systems.
翻译:排序相关性是搜索引擎中的一项基础任务,旨在识别与给定用户查询最相关的项目。传统的相关性模型通常产生标量分数或直接预测相关性标签,这既限制了可解释性,也限制了对复杂相关性信号的建模。受近期复杂任务中思维链推理进展的启发,我们研究了显式推理是否能够同时增强相关性建模的可解释性和性能。然而,现有的基于推理的生成式相关性模型主要依赖于对大量人工标注或合成CoT数据进行监督微调,这往往导致泛化能力有限。此外,领域无关的自由形式推理往往过于通用且缺乏充分的事实依据,限制了其处理开放域搜索中普遍存在的多样化和模糊案例的潜力。在这项工作中,我们将小红书搜索中的相关性建模形式化为一项推理任务,并引入一个基于强化学习的训练框架来增强GRMs的基于事实的推理能力。具体而言,我们将实际业务特定的相关性准则融入多步推理提示设计中,并提出了逐步优势掩蔽,这是一种轻量级的过程监督策略,通过改进的信用分配促进对这些准则的有效学习。为了实现工业部署,我们进一步将大规模RL调优模型蒸馏为适用于现实世界搜索系统的轻量级版本。大量的离线评估和在线A/B测试表明,我们的方法在关键相关性指标和业务指标上持续带来显著提升,验证了其对于大规模工业搜索系统的有效性、鲁棒性和实用性。