Supervised fine-tuning (SFT) on long chain-of-thought (CoT) trajectories has emerged as a crucial technique for enhancing the reasoning abilities of large language models (LLMs). However, the standard cross-entropy loss treats all tokens equally, ignoring their heterogeneous contributions across a reasoning trajectory. This uniform treatment leads to misallocated supervision and weak generalization, especially in complex, long-form reasoning tasks. To address this, we introduce \textbf{V}ariance-\textbf{C}ontrolled \textbf{O}ptimization-based \textbf{RE}weighting (VCORE), a principled framework that reformulates CoT supervision as a constrained optimization problem. By adopting an optimization-theoretic perspective, VCORE enables a principled and adaptive allocation of supervision across tokens, thereby aligning the training objective more closely with the goal of robust reasoning generalization. Empirical evaluations demonstrate that VCORE consistently outperforms existing token reweighting methods. Across both in-domain and out-of-domain settings, VCORE achieves substantial performance gains on mathematical and coding benchmarks, using models from the Qwen3 series (4B, 8B, 32B) and LLaMA-3.1-8B-Instruct. Moreover, we show that VCORE serves as a more effective initialization for subsequent reinforcement learning, establishing a stronger foundation for advancing the reasoning capabilities of LLMs. The Code will be released at https://github.com/coder-gx/VCORE.
翻译:在长思维链轨迹上进行监督微调已成为提升大语言模型推理能力的关键技术。然而,标准的交叉熵损失对所有词元一视同仁,忽略了它们在推理轨迹中的异质性贡献。这种均一化处理导致监督分配失当和泛化能力薄弱,尤其在复杂的长篇推理任务中更为明显。为解决这一问题,我们提出了基于方差控制的优化重加权,这是一个将思维链监督重构为约束优化问题的原理性框架。通过采用优化理论视角,VCORE实现了对词元间监督的合理且自适应分配,从而使训练目标更紧密地贴合鲁棒推理泛化的目标。实证评估表明,VCORE在各类任务中持续优于现有的词元重加权方法。在领域内和领域外场景下,基于Qwen3系列和LLaMA-3.1-8B-Instruct模型,VCORE在数学与编程基准测试中均取得显著性能提升。此外,我们证明VCORE能为后续强化学习提供更有效的初始化,为推进大语言模型的推理能力奠定更坚实的基础。代码将在https://github.com/coder-gx/VCORE发布。