Accurate inference of user intent is crucial for enhancing document retrieval in modern search engines. While large language models (LLMs) have made significant strides in this area, their effectiveness has predominantly been assessed with short, keyword-based queries. As AI-driven search evolves, long-form queries with intricate intents are becoming more prevalent, yet they remain underexplored in the context of LLM-based query understanding (QU). To bridge this gap, we introduce ReDI: a Reasoning-enhanced approach for query understanding through Decomposition and Interpretation. ReDI leverages the reasoning and comprehension capabilities of LLMs in a three-stage pipeline: (i) it breaks down complex queries into targeted sub-queries to accurately capture user intent; (ii) it enriches each sub-query with detailed semantic interpretations to improve the query-document matching; and (iii) it independently retrieves documents for each sub-query and employs a fusion strategy to aggregate the results for the final ranking. We compiled a large-scale dataset of real-world complex queries from a major search engine and distilled the query understanding capabilities of teacher models into smaller models for practical application. Experiments on BRIGHT and BEIR demonstrate that ReDI consistently surpasses strong baselines in both sparse and dense retrieval paradigms, affirming its effectiveness. We release our code at https://anonymous.4open.science/r/ReDI-6FC7/.
翻译:在现代搜索引擎中,准确推断用户意图对于提升文档检索效果至关重要。尽管大语言模型(LLMs)在这一领域已取得显著进展,但其有效性主要是在基于关键词的短查询上进行评估的。随着人工智能驱动的搜索不断发展,具有复杂意图的长形式查询日益普遍,但在基于LLM的查询理解(QU)背景下,这类查询仍未得到充分探索。为填补这一空白,我们提出了ReDI:一种通过分解与解释增强推理能力的查询理解方法。ReDI利用LLMs的推理与理解能力,构建了一个三阶段流程:(i)将复杂查询分解为有针对性的子查询,以准确捕捉用户意图;(ii)为每个子查询添加详细的语义解释,以改善查询与文档的匹配;(iii)为每个子查询独立检索文档,并采用融合策略聚合结果以进行最终排序。我们从主流搜索引擎收集了一个大规模的真实世界复杂查询数据集,并将教师模型的查询理解能力蒸馏到更小的模型中,以便实际应用。在BRIGHT和BEIR数据集上的实验表明,ReDI在稀疏检索和稠密检索两种范式下均持续超越强基线,证实了其有效性。我们的代码发布于https://anonymous.4open.science/r/ReDI-6FC7/。