Chain-of-thought (CoT) reasoning has emerged as a powerful technique for improving the problem-solving capabilities of large language models (LLMs), particularly for tasks requiring multi-step reasoning. However, recent studies show that CoT explanations often fail to reflect the model's actual reasoning process, as models may produce coherent yet misleading justifications or modify answers without acknowledging external cues. Such discrepancies undermine the reliability of CoT-based methods for safety supervision and alignment monitoring, as models can generate plausible but deceptive rationales for incorrect answers. To better understand this limitation, we evaluate two optimization methods, Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO), in their ability to improve CoT faithfulness. Our experiments show that GRPO achieves higher performance than DPO in larger models, with the Qwen2.5-14B-Instruct model attaining the best results across all evaluation metrics. Both approaches exhibit positive correlations between model size and performance, but GRPO shows greater potential for improving faithfulness metrics, albeit with less stable behavior at smaller scales. These results suggest that GRPO offers a promising direction for developing more transparent and trustworthy reasoning in LLMs.
翻译:思维链(CoT)推理已成为提升大型语言模型(LLMs)问题解决能力的一项强大技术,尤其适用于需要多步推理的任务。然而,近期研究表明,CoT解释往往无法反映模型的实际推理过程,因为模型可能生成连贯但具有误导性的论证,或在未明确提示的情况下修改答案。这种不一致性削弱了基于CoT的方法在安全监督和对齐监测中的可靠性,因为模型可能为错误答案生成看似合理但具有欺骗性的推理依据。为深入理解这一局限性,本研究评估了两种优化方法——组相对策略优化(GRPO)与直接偏好优化(DPO)——在提升CoT忠实性方面的表现。实验表明,在较大规模模型中,GRPO相比DPO取得了更高的性能,其中Qwen2.5-14B-Instruct模型在所有评估指标上均获得最佳结果。两种方法均显示模型规模与性能呈正相关,但GRPO在提升忠实性指标方面展现出更大潜力,尽管在较小规模时表现出较不稳定的行为。这些结果表明,GRPO为开发更透明、更可信的LLMs推理能力提供了有前景的研究方向。