Although LLMs have demonstrated improved performance by scaling parallel test-time compute, doing so relies on generating reasoning paths that are both diverse and accurate. For challenging problems, the forking tokens that trigger diverse yet correct reasoning modes are typically deep in the sampling tree. Consequently, common strategies to encourage diversity, such as temperature scaling, encounter a worsened trade-off between diversity and accuracy. Motivated by this challenge, we treat parallel reasoning as a set-of-next-token-prediction problem, and incorporate a set-based global loss into Supervised Fine-Tuning (SFT) using self-supervised bipartite matching between our global forking tokens and unique reasoning traces. We observe that, while naive fine-tuning with multiple reasoning traces collapses these unique reasoning modes, our proposed method, Set Supervised Fine-Tuning (SSFT), preserves these modes and produces emergent global forking tokens. Experiments on multiple reasoning benchmarks show that our SSFT consistently outperforms SFT under both Pass@1 and Cons@k metrics.
翻译:尽管大型语言模型通过扩展并行测试时计算展现出性能提升,但这依赖于生成既多样又准确的推理路径。对于具有挑战性的问题,触发多样且正确推理模式的分叉令牌通常位于采样树的深层。因此,鼓励多样性的常见策略(如温度缩放)在多样性与准确性之间面临更严峻的权衡。受此挑战启发,我们将并行推理视为一组下一令牌预测问题,并通过在全局分叉令牌与独特推理轨迹之间进行自监督二分图匹配,将基于集合的全局损失纳入监督微调中。我们观察到,虽然使用多个推理轨迹进行朴素微调会使这些独特推理模式坍缩,但我们提出的方法——集合监督微调——能够保留这些模式并产生涌现的全局分叉令牌。在多个推理基准测试上的实验表明,我们的SSFT在Pass@1和Cons@k指标下均持续优于SFT。