Accurate temporal segmentation of human actions is critical for intelligent robots in collaborative settings, where a precise understanding of sub-activity labels and their temporal structure is essential. However, the inherent noise in both human pose estimation and object detection often leads to over-segmentation errors, disrupting the coherence of action sequences. To address this, we propose a Multi-Modal Graph Convolutional Network (MMGCN) that integrates low-frame-rate (e.g., 1 fps) visual data with high-frame-rate (e.g., 30 fps) motion data (skeleton and object detections) to mitigate fragmentation. Our framework introduces three key contributions. First, a sinusoidal encoding strategy that maps 3D skeleton coordinates into a continuous sin-cos space to enhance spatial representation robustness. Second, a temporal graph fusion module that aligns multi-modal inputs with differing resolutions via hierarchical feature aggregation, Third, inspired by the smooth transitions inherent to human actions, we design SmoothLabelMix, a data augmentation technique that mixes input sequences and labels to generate synthetic training examples with gradual action transitions, enhancing temporal consistency in predictions and reducing over-segmentation artifacts. Extensive experiments on the Bimanual Actions Dataset, a public benchmark for human-object interaction understanding, demonstrate that our approach outperforms state-of-the-art methods, especially in action segmentation accuracy, achieving F1@10: 94.5% and F1@25: 92.8%.
翻译:人体动作的精确时间分割对于协作环境中的智能机器人至关重要,其中对子活动标签及其时间结构的准确理解是必不可少的。然而,人体姿态估计和目标检测中固有的噪声常导致过分割错误,破坏动作序列的连贯性。为解决此问题,我们提出了一种多模态图卷积网络(MMGCN),该网络将低帧率(例如1 fps)视觉数据与高帧率(例如30 fps)运动数据(骨架和目标检测)相结合,以减少碎片化。我们的框架引入了三个关键贡献。首先,一种正弦编码策略,将3D骨架坐标映射到连续的sin-cos空间,以增强空间表示的鲁棒性。其次,一个时间图融合模块,通过分层特征聚合对齐不同分辨率的多模态输入。第三,受人体动作固有的平滑过渡启发,我们设计了SmoothLabelMix,这是一种数据增强技术,通过混合输入序列和标签来生成具有渐进动作过渡的合成训练样本,从而增强预测的时间一致性并减少过分割伪影。在Bimanual Actions Dataset(一个用于人-物交互理解的公共基准数据集)上进行的大量实验表明,我们的方法在动作分割准确性方面优于现有最先进方法,特别是在F1@10达到94.5%和F1@25达到92.8%。