Human motion generation has shown great advances thanks to the recent diffusion models trained on large-scale motion capture data. Most of existing works, however, currently target animation of isolated people in empty scenes. Meanwhile, synthesizing realistic human-object interactions in complex 3D scenes remains a critical challenge in computer graphics and robotics. One obstacle towards generating versatile high-fidelity human-object interactions is the lack of large-scale datasets with diverse object manipulations. Indeed, existing motion capture data is typically restricted to single people and manipulations of limited sets of objects. To address this issue, we propose an automatic motion extraction pipeline and use it to collect interaction-rich human motions. Our new dataset InterPose contains 73.8K sequences of 3D human motions and corresponding text captions automatically obtained from 45.8K videos with human-object interactions. We perform extensive experiments and demonstrate InterPose to bring significant improvements to state-of-the-art methods for human motion generation. Moreover, using InterPose we develop an LLM-based agent enabling zero-shot animation of people interacting with diverse objects and scenes.
翻译:得益于基于大规模运动捕捉数据训练的扩散模型,人体动作生成领域已取得显著进展。然而,现有研究大多聚焦于在空场景中生成孤立人体的动画。与此同时,在复杂三维场景中合成真实的人-物交互动作,仍然是计算机图形学与机器人学领域的关键挑战。生成多样化高保真人-物交互动作的主要障碍在于缺乏包含多类别物体操作的大规模数据集。事实上,现有运动捕捉数据通常局限于单人场景及有限物体集合的操作。为应对这一问题,我们提出一种自动动作提取流程,并利用该流程收集富含交互信息的人体动作数据。我们构建的新数据集InterPose包含从45.8万段人-物交互视频中自动提取的7.38万条三维人体动作序列及对应文本描述。通过大量实验验证,InterPose能够显著提升当前最优人体动作生成方法的性能。此外,基于InterPose数据集,我们开发了一种基于大语言模型的智能体,实现了与多样化物体及场景进行交互的人体动作零样本动画生成。