Large language models (LLMs) deliver superior performance but require substantial computational resources and operate with relatively low efficiency, while smaller models can efficiently handle simpler tasks with fewer resources. LLM routing is a crucial paradigm that dynamically selects the most suitable large language models from a pool of candidates to process diverse inputs, ensuring optimal resource utilization while maintaining response quality. Existing routing frameworks typically model this as a locally optimal decision-making problem, selecting the presumed best-fit LLM for each query individually, which overlooks global budget constraints, resulting in ineffective resource allocation. To tackle this problem, we introduce OmniRouter, a fundamentally controllable routing framework for multi-LLM serving. Instead of making per-query greedy choices, OmniRouter models the routing task as a constrained optimization problem, assigning models that minimize total cost while ensuring the required performance level. Specifically, a hybrid retrieval-augmented predictor is designed to predict the capabilities and costs of LLMs. After obtaining the predicted cost and performance, we utilize a constrained optimizer for cost-optimal assignments that employs Lagrangian dual decomposition with adaptive multipliers. It iteratively converges toward the globally optimal query-model allocation, dynamically balancing latency minimization against quality thresholds while adhering to heterogeneous capacity constraints. Experiments show that OmniRouter achieves up to 6.30% improvement in response accuracy while simultaneously reducing computational costs by at least 10.15% compared to competitive router baselines. The code and the dataset are available at https://github.com/dongyuanjushi/OmniRouter.
翻译:大语言模型(LLMs)虽能提供卓越性能,但需要大量计算资源且运行效率相对较低,而较小模型则能以较少资源高效处理较简单任务。LLM路由是一种关键范式,它从候选模型池中动态选择最合适的大语言模型来处理多样化输入,在保证响应质量的同时实现资源的最优利用。现有路由框架通常将其建模为局部最优决策问题,即单独为每个查询选择预设的最佳LLM,这忽视了全局预算约束,导致资源分配效率低下。为解决此问题,我们提出了OmniRouter,一个从根本上可控的多LLM服务路由框架。OmniRouter不再基于单次查询进行贪婪选择,而是将路由任务建模为约束优化问题,通过分配模型以最小化总成本,同时确保所需性能水平。具体而言,我们设计了一种混合检索增强预测器,用于预测LLMs的能力与成本。在获得预测的成本与性能后,我们采用一个约束优化器进行成本最优分配,该优化器利用带有自适应乘子的拉格朗日对偶分解方法,迭代收敛至全局最优的查询-模型分配方案,在满足异构容量约束的同时,动态平衡延迟最小化与质量阈值。实验表明,与竞争性路由基线相比,OmniRouter在响应准确率上最高提升6.30%,同时计算成本至少降低10.15%。代码与数据集已公开于https://github.com/dongyuanjushi/OmniRouter。