Small reasoning models (SRMs) often overthink during tool use: they reach a correct tool-argument configuration, then continue reasoning and overwrite it with an incorrect final call. We diagnose overthinking via oracle rollouts that inject </think> at sentence boundaries. On the Berkeley Function Calling Leaderboard (BFCL), this oracle termination lifts average accuracy from 85.8\% to 94.2\% while reducing tokens by 80-94\%, revealing substantial recoverable headroom and potential redundant reasoning. While prior work on concise reasoning has largely targeted mathematics, tool reasoning remains underexplored. We adapt various early-termination baselines to tool use and introduce ThinkBrake, a training-free decoding heuristic. ThinkBrake monitors the log-probability margin between </think> and the current top token at sentence boundaries and triggers termination when this margin becomes small. Across BFCL's single turn, non-live and live splits, ThinkBrake preserves or improves accuracy while reducing tokens up to 25\%, outperforming various baselines.
翻译:小型推理模型(SRMs)在使用工具时经常出现过度思考现象:它们先得出正确的工具-参数配置,随后继续推理并用错误的最终调用覆盖该配置。我们通过在句子边界注入</think>标记的oracle rollout来诊断过度思考问题。在伯克利函数调用排行榜(BFCL)上,这种oracle终止机制将平均准确率从85.8%提升至94.2%,同时减少80-94%的token消耗,揭示了显著的可恢复性能空间和潜在的冗余推理。尽管先前关于简洁推理的研究主要针对数学领域,工具推理仍存在探索不足的问题。我们将多种早期终止基线方法适配至工具使用场景,并提出ThinkBrake——一种无需训练的解码启发式方法。ThinkBrake在句子边界监控</think>标记与当前最高概率token之间的对数概率边际,当该边际值变小时触发终止机制。在BFCL的单轮对话、非实时和实时数据划分中,ThinkBrake在减少高达25%token消耗的同时保持或提升了准确率,其性能优于多种基线方法。