RoBoN：基于路由的在线最佳n选一方法，实现多LLM的测试时扩展 (RoBoN: Routed Online Best-of-n for Test-Time Scaling with Multiple LLMs)

from arxiv, 20 pages, 3 figures. 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Foundations of Reasoning in Language Models

Best-of-$n$ is a widely used test-time scaling approach for LLM inference. Yet despite evidence that LLMs exhibit complementary strengths across tasks, traditionally best-of-$n$ relies on a single model to generate responses. We propose RoBoN (Routed Online Best-of-$n$), a sequential multi-LLM alternative to the prevailing single-model best-of-$n$. Given a suite of models $\{m_i\}_{i=1}^M$, RoBoN sequentially routes generations one-by-one across models, based on scores computed using a reward model and an agreement signal on the predicted responses. This online routing requires no additional training, keeps compute parity, and works with any plug-in reward model. Across reasoning benchmarks (MATH500, OlympiadBench, MinervaMath, GSM8K, MMLU), RoBoN consistently outperforms standard best-of-$n$ applied to each individual model for larger $n$, with gains of up to 3.4\% in absolute accuracy, and also improves over a uniform multi-model portfolio baseline. Our results indicate that diversity across models can be exploited at inference to improve best-of-$n$ performance over any constituent model alone, providing a simple, training-free path to test-time scaling with multiple LLMs.

翻译：最佳n选一（Best-of-$n$）是一种广泛应用于LLM推理的测试时扩展方法。然而，尽管有证据表明LLM在不同任务上展现出互补优势，传统的最佳n选一方法仍依赖单一模型生成响应。我们提出了RoBoN（基于路由的在线最佳n选一），作为当前主流单模型最佳n选一方法的顺序多LLM替代方案。给定模型集合$\{m_i\}_{i=1}^M$，RoBoN基于奖励模型计算的分数和预测响应的一致性信号，逐个顺序地将生成过程路由至不同模型。这种在线路由无需额外训练，保持计算对等性，并可与任何插件式奖励模型协同工作。在推理基准测试（MATH500、OlympiadBench、MinervaMath、GSM8K、MMLU）中，对于较大的$n$值，RoBoN始终优于对每个独立模型应用的标准最佳n选一方法，绝对准确率提升最高达3.4%，同时优于均匀多模型组合基线。我们的结果表明，在推理阶段可利用模型间的多样性，使最佳n选一性能超越任何单一组成模型，为多LLM的测试时扩展提供了一条简单且无需训练的路径。