OPTAGENT：通过言语强化学习优化多智能体LLM交互以增强推理能力 (OPTAGENT: Optimizing Multi-Agent LLM Interactions Through Verbal Reinforcement Learning for Enhanced Reasoning)

Large Language Models (LLMs) have shown remarkable reasoning capabilities in mathematical and scientific tasks. To enhance complex reasoning, multi-agent systems have been proposed to harness the collective intelligence of LLM agents. However, existing collaboration structures are either predefined or rely on majority voting or round-table debates, which can suppress correct but less dominant agent contributions. Recent approaches model multi-agent systems as graph networks but optimize purely for agent performance, neglecting the quality of interactions. We hypothesize that effective agent communication is crucial for multi-agent reasoning and that debating quality plays a significant role. To address this, we propose $\ours$, a multi-agent verbal reinforcement learning algorithm that dynamically constructs and refines multi-agent collaboration structures. Our method defines action spaces and a feedback mechanism that evaluates communication robustness and coherence throughout the debate. The final decision is achieved through a majority vote over all the agents. We assess $\ours$ on various reasoning tasks, including mathematical reasoning, creative writing, scientific reasoning, and numerical sorting. Results demonstrate that our approach significantly outperforms single-agent prompting methods and state-of-the-art multi-agent frameworks on diverse tasks.

翻译：大型语言模型（LLM）在数学与科学任务中展现出卓越的推理能力。为提升复杂推理性能，多智能体系统被提出以利用LLM智能体的集体智慧。然而，现有协作结构多为预定义模式，或依赖于多数表决及圆桌辩论机制，这可能压制正确但非主导性智能体的贡献。近期研究将多智能体系统建模为图网络，但仅针对智能体性能进行优化，忽视了交互质量的重要性。我们假设有效的智能体通信对多智能体推理至关重要，且辩论质量在其中发挥关键作用。为此，我们提出$\ours$——一种通过言语强化学习动态构建与优化多智能体协作结构的算法。该方法定义了动作空间及反馈机制，用于评估辩论全程中通信的鲁棒性与连贯性。最终决策通过全体智能体的多数表决达成。我们在数学推理、创意写作、科学推理及数值排序等多种推理任务上评估$\ours$。实验结果表明，该方法在多样化任务中显著优于单智能体提示方法及当前最先进的多智能体框架。