Verifiers can improve language model capabilities by scoring and ranking responses from generated candidates. Currently, high-quality verifiers are either unscalable (e.g., humans) or limited in utility (e.g., tools like Lean). While LM judges and reward models have become broadly useful as general-purpose verifiers, a significant performance gap remains between them and oracle verifiers (verifiers with perfect accuracy). To help close this gap, we introduce Weaver, a framework for designing a strong verifier by combining multiple weak, imperfect verifiers. We find weighted ensembles of verifiers, which typically require learning from labeled data, significantly outperform unweighted combinations due to differences in verifier accuracies. To reduce dependency on labeled data, Weaver leverages weak supervision to estimate each verifier's accuracy and combines outputs into a unified score that better reflects true response quality. However, directly applying weak supervision algorithms poses challenges, including inconsistent verifier output formats and handling low-quality verifiers. Weaver addresses these using dataset statistics to normalize outputs and filter specific verifiers. We study Weaver's effectiveness in test-time repeated sampling, where a model generates multiple candidate responses and selects one. Our evaluations show Weaver significantly improves over Pass@1-performance when selecting the first candidate-across reasoning and math tasks, achieving o3-mini-level accuracy with Llama 3.3 70B Instruct as generator, and an ensemble of 70B or smaller judge and reward models as verifiers (87.7% average). This gain mirrors the jump between GPT-4o and o3-mini (69.0% vs. 86.7%), which required extensive finetuning and post-training. To reduce computational costs of verifier ensembles, we train a 400M cross-encoder using Weaver's combined output scores.
翻译:验证器可通过评分和排序生成候选响应来提升语言模型的能力。当前,高质量的验证器要么难以扩展(如人类),要么实用性有限(如Lean等工具)。尽管语言模型评判器和奖励模型已作为通用验证器得到广泛应用,但它们与理想验证器(具有完美准确性的验证器)之间仍存在显著的性能差距。为缩小这一差距,我们提出了Weaver框架,通过组合多个弱而不完美的验证器来设计强验证器。我们发现,由于验证器准确性的差异,通常需要从标注数据中学习的加权集成验证器显著优于未加权组合。为减少对标注数据的依赖,Weaver利用弱监督来估计每个验证器的准确性,并将其输出组合成一个能更好反映真实响应质量的统一分数。然而,直接应用弱监督算法面临挑战,包括验证器输出格式不一致以及处理低质量验证器的问题。Weaver通过数据集统计量来标准化输出并过滤特定验证器以应对这些挑战。我们在测试时重复采样的场景中研究了Weaver的有效性,即模型生成多个候选响应并选择其一。评估结果表明,在推理和数学任务中,当选择首个候选响应时,Weaver相较于Pass@1性能有显著提升:以Llama 3.3 70B Instruct作为生成器,并以70B或更小规模的评判器和奖励模型集成作为验证器,达到了o3-mini级别的准确率(平均87.7%)。这一增益相当于GPT-4o与o3-mini之间的跃升(69.0%对比86.7%),而后者需要大量的微调和后训练。为降低验证器集成的计算成本,我们使用Weaver的组合输出分数训练了一个400M的交叉编码器。