GazeInterpreter：解析眼动注视以生成眼动-身体协调的叙述 (GazeInterpreter: Parsing Eye Gaze to Generate Eye-Body-Coordinated Narrations)

Comprehensively interpreting human behavior is a core challenge in human-aware artificial intelligence. However, prior works typically focused on body behavior, neglecting the crucial role of eye gaze and its synergy with body motion. We present GazeInterpreter - a novel large language model-based (LLM-based) approach that parses eye gaze data to generate eye-body-coordinated narrations. Specifically, our method features 1) a symbolic gaze parser that translates raw gaze signals into symbolic gaze events; 2) a hierarchical structure that first uses an LLM to generate eye gaze narration at semantic level and then integrates gaze with body motion within the same observation window to produce integrated narration; and 3) a self-correcting loop that iteratively refines the modality match, temporal coherence, and completeness of the integrated narration. This hierarchical and iterative processing can effectively align physical values and semantic text in the temporal and spatial domains. We validated the effectiveness of our eye-body-coordinated narrations on the text-driven motion generation task in the large-scale Nymeria benchmark. Moreover, we report significant performance improvements for the sample downstream tasks of action anticipation and behavior summarization. Taken together, these results reveal the significant potential of parsing eye gaze to interpret human behavior and open up a new direction for human behavior understanding.

翻译：全面解读人类行为是面向人类的人工智能的核心挑战。然而，先前的研究通常侧重于身体行为，忽视了眼动注视的关键作用及其与身体运动的协同效应。我们提出了GazeInterpreter——一种基于大型语言模型（LLM）的创新方法，通过解析眼动注视数据来生成眼动-身体协调的叙述。具体而言，我们的方法具备以下特点：1）一个符号化眼动解析器，将原始眼动信号转换为符号化眼动事件；2）一种分层结构，首先利用LLM在语义层面生成眼动注视叙述，然后在同一观察窗口内将眼动与身体运动整合，以生成综合叙述；3）一个自我校正循环，迭代优化综合叙述的模态匹配、时间连贯性和完整性。这种分层迭代处理能有效对齐时空域中的物理值和语义文本。我们在大规模Nymeria基准测试中，通过文本驱动运动生成任务验证了我们眼动-身体协调叙述的有效性。此外，我们在动作预测和行为摘要等下游任务中报告了显著的性能提升。综上所述，这些结果揭示了解析眼动注视以解读人类行为的巨大潜力，并为人类行为理解开辟了新的研究方向。