Recent work has shown that directly fine-tuning large language models (LLMs) for dense retrieval yields strong performance, but their substantial parameter counts make them computationally inefficient. While prior studies have revealed significant layer redundancy in LLMs for generative tasks, it remains unclear whether similar redundancy exists when these models are adapted for retrieval tasks, which require encoding entire sequences into fixed representations rather than generating tokens iteratively. To this end, we conduct a comprehensive analysis of layer redundancy in LLM-based dense retrievers. We find that, in contrast to generative settings, MLP layers are substantially more prunable, while attention layers remain critical for semantic aggregation. Building on this insight, we propose EffiR, a framework for developing efficient retrievers that performs large-scale MLP compression through a coarse-to-fine strategy (coarse-grained depth reduction followed by fine-grained width reduction), combined with retrieval-specific fine-tuning. Across diverse BEIR datasets and LLM backbones, EffiR achieves substantial reductions in model size and inference cost while preserving the performance of full-size models.
翻译:近期研究表明,直接对大型语言模型(LLMs)进行稠密检索的微调可获得强劲性能,但其庞大的参数量导致计算效率低下。尽管先前研究揭示了LLMs在生成任务中存在显著的层冗余,但当这些模型被适配于检索任务时(此类任务需要将完整序列编码为固定表示而非迭代生成标记),是否仍存在类似冗余尚不明确。为此,我们对基于LLM的稠密检索器进行了层冗余的全面分析。研究发现,与生成场景相反,MLP层具有更高的可剪裁性,而注意力层对于语义聚合仍至关重要。基于此洞见,我们提出EffiR——一种开发高效检索器的框架,该框架通过从粗到精的策略(先粗粒度深度缩减,后细粒度宽度缩减)执行大规模MLP压缩,并结合检索专用微调。在多样化BEIR数据集和LLM骨干网络上,EffiR在保持全尺寸模型性能的同时,实现了模型规模与推理成本的大幅降低。