基于正弦编码的多模态图卷积网络用于鲁棒的人体动作分割 (Multi-Modal Graph Convolutional Network with Sinusoidal Encoding for Robust Human Action Segmentation)

Accurate temporal segmentation of human actions is critical for intelligent robots in collaborative settings, where a precise understanding of sub-activity labels and their temporal structure is essential. However, the inherent noise in both human pose estimation and object detection often leads to over-segmentation errors, disrupting the coherence of action sequences. To address this, we propose a Multi-Modal Graph Convolutional Network (MMGCN) that integrates low-frame-rate (e.g., 1 fps) visual data with high-frame-rate (e.g., 30 fps) motion data (skeleton and object detections) to mitigate fragmentation. Our framework introduces three key contributions. First, a sinusoidal encoding strategy that maps 3D skeleton coordinates into a continuous sin-cos space to enhance spatial representation robustness. Second, a temporal graph fusion module that aligns multi-modal inputs with differing resolutions via hierarchical feature aggregation, Third, inspired by the smooth transitions inherent to human actions, we design SmoothLabelMix, a data augmentation technique that mixes input sequences and labels to generate synthetic training examples with gradual action transitions, enhancing temporal consistency in predictions and reducing over-segmentation artifacts. Extensive experiments on the Bimanual Actions Dataset, a public benchmark for human-object interaction understanding, demonstrate that our approach outperforms state-of-the-art methods, especially in action segmentation accuracy, achieving F1@10: 94.5% and F1@25: 92.8%.

翻译：人体动作的精确时间分割对于协作环境中的智能机器人至关重要，其中对子活动标签及其时间结构的准确理解是必不可少的。然而，人体姿态估计和目标检测中固有的噪声常导致过分割错误，破坏动作序列的连贯性。为解决此问题，我们提出了一种多模态图卷积网络（MMGCN），该网络将低帧率（例如1 fps）视觉数据与高帧率（例如30 fps）运动数据（骨架和目标检测）相结合，以减少碎片化。我们的框架引入了三个关键贡献。首先，一种正弦编码策略，将3D骨架坐标映射到连续的sin-cos空间，以增强空间表示的鲁棒性。其次，一个时间图融合模块，通过分层特征聚合对齐不同分辨率的多模态输入。第三，受人体动作固有的平滑过渡启发，我们设计了SmoothLabelMix，这是一种数据增强技术，通过混合输入序列和标签来生成具有渐进动作过渡的合成训练样本，从而增强预测的时间一致性并减少过分割伪影。在Bimanual Actions Dataset（一个用于人-物交互理解的公共基准数据集）上进行的大量实验表明，我们的方法在动作分割准确性方面优于现有最先进方法，特别是在F1@10达到94.5%和F1@25达到92.8%。