谁得奖赏，谁受责备？面向多LLM智能体的评估对齐训练信号 (Who Gets the Reward, Who Gets the Blame? Evaluation-Aligned Training Signals for Multi-LLM Agents)

Large Language Models (LLMs) in multi-agent systems (MAS) have shown promise for complex tasks, yet current training methods lack principled ways to connect system-level evaluation with agent-level and message-level learning. We propose a theoretical framework that unifies cooperative game-theoretic attribution with process reward modeling to transform system evaluation into agent credit and then into response-level signals. Unlike prior approaches that rely only on attribution (e.g., Shapley) or step-level labels (e.g., PRM), our method produces local, signed, and credit-conserving signals. In success cases, Shapley-based credit assignment fairly allocates outcomes across agents and is refined into per-message rewards that promote cooperation while discouraging redundancy or sabotage. In failure cases, first-error localization yields repair-aware preferences that penalize harmful steps while rewarding corrective attempts. The resulting signals are bounded, cooperative, and directly compatible with reinforcement-based or preference-based post-training, providing a unified and auditable pathway from global evaluation to local supervision in LLM multi-agent training. Our contribution is conceptual: we present a theoretical foundation and training signals, leaving empirical validation for future work.

翻译：多智能体系统中的大语言模型在复杂任务中展现出潜力，但现有训练方法缺乏将系统级评估与智能体级及消息级学习相连接的原则性途径。本文提出一个理论框架，将合作博弈论归因与过程奖励建模相统一，从而将系统评估转化为智能体信用，进而生成响应级信号。与仅依赖归因方法（如Shapley值）或步骤级标签（如过程奖励模型）的先前方法不同，我们的方法能产生局部、有符号且信用守恒的信号。在成功案例中，基于Shapley值的信用分配公平地在各智能体间分配结果，并细化为每条消息的奖励，以促进协作同时抑制冗余或破坏行为。在失败案例中，首错定位生成具有修复意识的偏好，既惩罚有害步骤又奖励纠正尝试。所得信号具有有界性、合作性，且可直接与基于强化学习或偏好的后训练方法兼容，为大语言模型多智能体训练提供了从全局评估到局部监督的统一可审计路径。本研究的贡献是概念性的：我们提出了理论基础与训练信号框架，实证验证留待后续工作完成。