Humanoid agents are expected to emulate the complex coordination inherent in human social behaviors. However, existing methods are largely confined to single-agent scenarios, overlooking the physically plausible interplay essential for multi-agent interactions. To bridge this gap, we propose InterAgent, the first end-to-end framework for text-driven physics-based multi-agent humanoid control. At its core, we introduce an autoregressive diffusion transformer equipped with multi-stream blocks, which decouples proprioception, exteroception, and action to mitigate cross-modal interference while enabling synergistic coordination. We further propose a novel interaction graph exteroception representation that explicitly captures fine-grained joint-to-joint spatial dependencies to facilitate network learning. Additionally, within it we devise a sparse edge-based attention mechanism that dynamically prunes redundant connections and emphasizes critical inter-agent spatial relations, thereby enhancing the robustness of interaction modeling. Extensive experiments demonstrate that InterAgent consistently outperforms multiple strong baselines, achieving state-of-the-art performance. It enables producing coherent, physically plausible, and semantically faithful multi-agent behaviors from only text prompts. Our code and data will be released to facilitate future research.
翻译:人形智能体需模拟人类社交行为中固有的复杂协调能力。然而,现有方法大多局限于单智能体场景,忽视了多智能体交互所必需的物理合理性互动。为弥补这一空白,我们提出了InterAgent——首个基于文本驱动的物理多智能体人形控制端到端框架。其核心在于引入了配备多流模块的自回归扩散Transformer,该架构通过解耦本体感知、外部感知与动作执行来缓解跨模态干扰,同时实现协同协调。我们进一步提出了一种新颖的交互图外部感知表征方法,显式捕捉细粒度关节间空间依赖关系以促进网络学习。此外,在该框架中我们设计了一种基于稀疏边的注意力机制,能动态剪枝冗余连接并强化关键智能体间空间关联,从而提升交互建模的鲁棒性。大量实验表明,InterAgent在多个强基线方法中持续取得优越性能,实现了最先进的结果。该框架仅通过文本提示即可生成连贯、物理合理且语义忠实的多智能体行为。我们将公开代码与数据以促进后续研究。