Spatiotemporal video grounding aims to localize target entities in videos based on textual queries. While existing research has made significant progress in exocentric videos, the egocentric setting remains relatively underexplored, despite its growing importance in applications such as augmented reality and robotics. In this work, we conduct a systematic analysis of the discrepancies between egocentric and exocentric videos, revealing key challenges such as shorter object durations, sparser trajectories, smaller object sizes, and larger positional shifts. To address these challenges, we introduce EgoMask, the first pixel-level benchmark for fine-grained spatiotemporal grounding in egocentric videos. It is constructed by our proposed automatic annotation pipeline, which annotates referring expressions and object masks across short-, medium-, and long-term videos. Additionally, we create EgoMask-Train, a large-scale training dataset to facilitate model development. Experiments demonstrate that the state-of-the-art spatiotemporal grounding models perform poorly on our benchmark EgoMask, but fine-tuning on EgoMask-Train yields significant improvements, while preserving performance on exocentric datasets. Our work thus provides essential resources and insights for advancing egocentric video understanding. Our code is available at https://github.com/LaVi-Lab/EgoMask .
翻译:时空视频定位旨在根据文本查询在视频中定位目标实体。尽管现有研究在第三人称视频中取得了显著进展,但第一人称视角下的定位研究仍相对不足,尽管其在增强现实和机器人等应用中的重要性日益增长。本研究系统分析了第一人称与第三人称视频之间的差异,揭示了关键挑战,包括物体持续时间更短、轨迹更稀疏、物体尺寸更小以及位置偏移更大。为应对这些挑战,我们提出了EgoMask——首个针对第一人称视频细粒度时空定位的像素级基准数据集。该数据集通过我们提出的自动标注流程构建,该流程可对短、中、长期视频中的指代表达式和物体掩码进行标注。此外,我们创建了大规模训练数据集EgoMask-Train以促进模型开发。实验表明,当前最先进的时空定位模型在我们的基准数据集EgoMask上表现不佳,但在EgoMask-Train上进行微调后性能显著提升,同时保持了对第三人称数据集的性能。因此,本研究为推进第一人称视频理解提供了关键资源和见解。代码已发布于https://github.com/LaVi-Lab/EgoMask。