Best-of-$n$ is a widely used test-time scaling approach for LLM inference. Yet despite evidence that LLMs exhibit complementary strengths across tasks, traditionally best-of-$n$ relies on a single model to generate responses. We propose RoBoN (Routed Online Best-of-$n$), a sequential multi-LLM alternative to the prevailing single-model best-of-$n$. Given a suite of models $\{m_i\}_{i=1}^M$, RoBoN sequentially routes generations one-by-one across models, based on scores computed using a reward model and an agreement signal on the predicted responses. This online routing requires no additional training, keeps compute parity, and works with any plug-in reward model. Across reasoning benchmarks (MATH500, OlympiadBench, MinervaMath, GSM8K, MMLU), RoBoN consistently outperforms standard best-of-$n$ applied to each individual model for larger $n$, with gains of up to 3.4\% in absolute accuracy, and also improves over a uniform multi-model portfolio baseline. Our results indicate that diversity across models can be exploited at inference to improve best-of-$n$ performance over any constituent model alone, providing a simple, training-free path to test-time scaling with multiple LLMs.
翻译:最佳n选一(Best-of-$n$)是一种广泛应用于LLM推理的测试时扩展方法。然而,尽管有证据表明LLM在不同任务上展现出互补优势,传统的最佳n选一方法仍依赖单一模型生成响应。我们提出了RoBoN(基于路由的在线最佳n选一),作为当前主流单模型最佳n选一方法的顺序多LLM替代方案。给定模型集合$\{m_i\}_{i=1}^M$,RoBoN基于奖励模型计算的分数和预测响应的一致性信号,逐个顺序地将生成过程路由至不同模型。这种在线路由无需额外训练,保持计算对等性,并可与任何插件式奖励模型协同工作。在推理基准测试(MATH500、OlympiadBench、MinervaMath、GSM8K、MMLU)中,对于较大的$n$值,RoBoN始终优于对每个独立模型应用的标准最佳n选一方法,绝对准确率提升最高达3.4%,同时优于均匀多模型组合基线。我们的结果表明,在推理阶段可利用模型间的多样性,使最佳n选一性能超越任何单一组成模型,为多LLM的测试时扩展提供了一条简单且无需训练的路径。