Increasing test-time computation has emerged as a promising direction for improving language model performance, particularly in scenarios where model finetuning is impractical or impossible due to computational constraints or private model weights. However, existing test-time search methods using a reward model (RM) often degrade in quality as compute scales, due to the over-optimization of what are inherently imperfect reward proxies. We introduce QAlign, a new test-time alignment approach. As we scale test-time compute, QAlign converges to sampling from the optimal aligned distribution for each individual prompt. By adopting recent advances in Markov chain Monte Carlo for text generation, our method enables better-aligned outputs without modifying the underlying model or even requiring logit access. We demonstrate the effectiveness of QAlign on mathematical reasoning benchmarks (GSM8K and GSM-Symbolic) using a task-specific RM, showing consistent improvements over existing test-time compute methods like best-of-n and majority voting. Furthermore, when applied with more realistic RMs trained on the Tulu 3 preference dataset, QAlign outperforms direct preference optimization (DPO), best-of-n, majority voting, and weighted majority voting on a diverse range of datasets (GSM8K, MATH500, IFEval, MMLU-Redux, and TruthfulQA). A practical solution to aligning language models at test time using additional computation without degradation, our approach expands the limits of the capability that can be obtained from off-the-shelf language models without further training.
翻译:增加测试时计算已成为提升语言模型性能的一个有前景的方向,尤其在因计算资源限制或模型权重私有化导致模型微调不可行或无法实施的场景中。然而,现有基于奖励模型的测试时搜索方法往往随着计算规模扩大而质量下降,这是由于对本质上不完美的奖励代理进行了过度优化。我们提出QAlign,一种新的测试时对齐方法。随着测试时计算规模的增加,QAlign会收敛至从针对每个独立提示的最优对齐分布中进行采样。通过采用文本生成中马尔可夫链蒙特卡洛方法的最新进展,我们的方法能够在无需修改底层模型甚至无需访问logits的情况下,实现更好对齐的输出。我们在数学推理基准测试上验证了QAlign的有效性,使用特定任务的奖励模型在GSM8K和GSM-Symbolic数据集上均取得了优于现有测试时计算方法的结果。此外,当应用于基于Tulu 3偏好数据集训练的更真实奖励模型时,QAlign在多样化数据集上均优于直接偏好优化、最佳n采样、多数投票及加权多数投票等方法。作为一种无需性能退化即可利用额外计算在测试时对齐语言模型的实用解决方案,我们的方法拓展了无需进一步训练即可从现成语言模型中获取能力的极限。