InterAgent：基于物理的多智能体命令执行——基于交互图上的扩散方法 (InterAgent: Physics-based Multi-agent Command Execution via Diffusion on Interaction Graphs)

Humanoid agents are expected to emulate the complex coordination inherent in human social behaviors. However, existing methods are largely confined to single-agent scenarios, overlooking the physically plausible interplay essential for multi-agent interactions. To bridge this gap, we propose InterAgent, the first end-to-end framework for text-driven physics-based multi-agent humanoid control. At its core, we introduce an autoregressive diffusion transformer equipped with multi-stream blocks, which decouples proprioception, exteroception, and action to mitigate cross-modal interference while enabling synergistic coordination. We further propose a novel interaction graph exteroception representation that explicitly captures fine-grained joint-to-joint spatial dependencies to facilitate network learning. Additionally, within it we devise a sparse edge-based attention mechanism that dynamically prunes redundant connections and emphasizes critical inter-agent spatial relations, thereby enhancing the robustness of interaction modeling. Extensive experiments demonstrate that InterAgent consistently outperforms multiple strong baselines, achieving state-of-the-art performance. It enables producing coherent, physically plausible, and semantically faithful multi-agent behaviors from only text prompts. Our code and data will be released to facilitate future research.

翻译：人形智能体被期望模拟人类社交行为中固有的复杂协调性。然而，现有方法主要局限于单智能体场景，忽视了多智能体交互所必需的物理合理相互作用。为填补这一空白，我们提出了InterAgent，首个用于文本驱动的、基于物理的多智能体人形控制的端到端框架。其核心在于引入了一个配备多流块的自回归扩散Transformer，该模型解耦了本体感知、外部感知与动作，以减轻跨模态干扰，同时实现协同协调。我们进一步提出了一种新颖的交互图外部感知表示，该表示显式捕获细粒度的关节间空间依赖关系，以促进网络学习。此外，我们在其中设计了一种基于稀疏边的注意力机制，该机制动态剪枝冗余连接并强调关键智能体间空间关系，从而增强交互建模的鲁棒性。大量实验表明，InterAgent始终优于多个强基线方法，实现了最先进的性能。它能够仅从文本提示中生成连贯、物理合理且语义忠实的多智能体行为。我们的代码和数据将公开发布，以促进未来研究。