Humanoid agents are expected to emulate the complex coordination inherent in human social behaviors. However, existing methods are largely confined to single-agent scenarios, overlooking the physically plausible interplay essential for multi-agent interactions. To bridge this gap, we propose InterAgent, the first end-to-end framework for text-driven physics-based multi-agent humanoid control. At its core, we introduce an autoregressive diffusion transformer equipped with multi-stream blocks, which decouples proprioception, exteroception, and action to mitigate cross-modal interference while enabling synergistic coordination. We further propose a novel interaction graph exteroception representation that explicitly captures fine-grained joint-to-joint spatial dependencies to facilitate network learning. Additionally, within it we devise a sparse edge-based attention mechanism that dynamically prunes redundant connections and emphasizes critical inter-agent spatial relations, thereby enhancing the robustness of interaction modeling. Extensive experiments demonstrate that InterAgent consistently outperforms multiple strong baselines, achieving state-of-the-art performance. It enables producing coherent, physically plausible, and semantically faithful multi-agent behaviors from only text prompts. Our code and data will be released to facilitate future research.
翻译:人形智能体被期望模拟人类社交行为中固有的复杂协调性。然而,现有方法主要局限于单智能体场景,忽视了多智能体交互所必需的物理合理相互作用。为填补这一空白,我们提出了InterAgent,首个用于文本驱动的、基于物理的多智能体人形控制的端到端框架。其核心在于引入了一个配备多流块的自回归扩散Transformer,该模型解耦了本体感知、外部感知与动作,以减轻跨模态干扰,同时实现协同协调。我们进一步提出了一种新颖的交互图外部感知表示,该表示显式捕获细粒度的关节间空间依赖关系,以促进网络学习。此外,我们在其中设计了一种基于稀疏边的注意力机制,该机制动态剪枝冗余连接并强调关键智能体间空间关系,从而增强交互建模的鲁棒性。大量实验表明,InterAgent始终优于多个强基线方法,实现了最先进的性能。它能够仅从文本提示中生成连贯、物理合理且语义忠实的多智能体行为。我们的代码和数据将公开发布,以促进未来研究。