Large language models (LLMs) have demonstrated remarkable reasoning capabilities through test-time scaling approaches, particularly when fine-tuned with chain-of-thought (CoT) data distilled from more powerful large reasoning models (LRMs). However, these reasoning chains often contain verbose elements that mirror human problem-solving, categorized as progressive reasoning (the essential solution development path) and functional elements (verification processes, alternative solution approaches, and error corrections). While progressive reasoning is crucial, the functional elements significantly increase computational demands during test-time inference. We introduce PIR (Perplexity-based Importance Refinement), a principled framework that quantitatively evaluates the importance of each reasoning step based on its impact on answer prediction confidence. PIR systematically identifies and selectively prunes only low-importance functional steps while preserving progressive reasoning components, creating optimized training data that maintains the integrity of the core solution path while reducing verbosity. Models fine-tuned on PIR-optimized data exhibit superior test-time scaling properties, generating more concise reasoning chains while achieving improved accuracy (+0.9\% to +6.6\%) with significantly reduced token usage (-3\% to -41\%) across challenging reasoning benchmarks (AIME, AMC, and GPQA Diamond). Our approach demonstrates strong generalizability across different model sizes, data sources, and token budgets, offering a practical solution for deploying reasoning-capable LLMs in scenarios where efficient test-time scaling, response time, and computational efficiency are valuable constraints.
翻译:大型语言模型(LLM)通过测试时扩展方法展现出卓越的推理能力,尤其是在使用从更强大的大型推理模型(LRM)中提炼出的思维链(CoT)数据进行微调后。然而,这些推理链常包含反映人类问题解决过程的冗余成分,可归类为渐进推理(核心解题路径)与功能要素(验证过程、替代解法及错误修正)。虽然渐进推理至关重要,但功能要素会显著增加测试时推理的计算负担。本文提出基于困惑度的重要性精炼框架(PIR),该原则性框架依据各推理步骤对答案预测置信度的影响进行定量重要性评估。PIR系统化识别并选择性剪枝低重要性功能步骤,同时保留渐进推理成分,从而构建保持核心解题路径完整性并降低冗余度的优化训练数据。基于PIR优化数据微调的模型展现出更优的测试时扩展特性:在生成更简洁推理链的同时,于复杂推理基准测试(AIME、AMC及GPQA Diamond)中实现了准确率提升(+0.9%至+6.6%)与显著降低的标记使用量(-3%至-41%)。该方法在不同模型规模、数据源和标记预算条件下均表现出强泛化能力,为在高效测试时扩展、响应时间与计算效率具有重要价值的场景中部署具备推理能力的LLM提供了实用解决方案。