Although Large Language Models (LLMs) show exceptional fluency, efforts persist to extract stronger reasoning capabilities from them. Drawing on search-based interpretations of LLM computation, this paper advances a systematic framework for understanding LLM reasoning and optimization. Namely, that enhancing reasoning is best achieved by structuring a multi-agent pipeline to ensure a traversal of the search space in a gradual, incremental, and sequential (GIS) manner. Stated succinctly, high-quality reasoning is a controlled, incremental search. To test this framework, we investigate the efficacy of recursive refinement (RR)--an iterative process of self-criticism, adversarial stress-testing, and integrating critical feedback--as a practical method for implementing GIS search. We designed an experiment comparing a simple, linear pipeline against a complex, explicitly structured pipeline leveraging a recursive refinement layer. The multi-agent models were constructed to reflect the historical personas of three US Founding Fathers (Hamilton, Jefferson, and Madison) using RAG-powered corpora and were prompted to generate responses to three contemporary political issues. Model performance was evaluated using a two-tiered approach: a quantitative score from an LLM arbiter agent and qualitative human judgment. Our results revealed that the complex model consistently outperformed the simple model across all nine test cases with an average arbiter-outputted score of 88.3 versus 71.7. The complex model's arguments were superior in analytical depth, structural nuance, and strategic framing. We conclude that recursive refinement is a robust architectural feature for enhancing LLM reasoning via GIS search.
翻译:尽管大型语言模型(LLMs)展现出卓越的流畅性,学界仍在持续探索如何从中提取更强的推理能力。基于对LLM计算的搜索式解读,本文提出一个系统性框架以理解LLM的推理与优化机制。具体而言,增强推理能力的最佳途径是通过构建多智能体管道,确保以渐进式、增量式、序列化(GIS)的方式遍历搜索空间。简言之,高质量推理即受控的增量搜索。为验证该框架,本研究探讨了递归精炼(RR)——一种融合自我批判、对抗性压力测试与关键反馈整合的迭代过程——作为实现GIS搜索的实用方法。我们设计了一项实验,对比简单线性管道与利用递归精炼层的复杂显式结构化管道。多智能体模型基于RAG增强语料库构建,模拟美国三位开国元勋(汉密尔顿、杰斐逊、麦迪逊)的历史人格,并针对三个当代政治议题生成回应。模型性能通过双层方法评估:LLM仲裁智能体的量化评分与人工定性评判。结果显示,复杂模型在全部九个测试案例中均优于简单模型,仲裁智能体给出的平均得分分别为88.3分与71.7分。复杂模型的论证在分析深度、结构细微度与策略框架方面均表现更优。本研究结论表明,递归精炼是通过GIS搜索增强LLM推理能力的稳健架构特征。