教会大语言模型观察与引导：增强现实中的上下文感知实时辅助系统 (Teaching LLMs to See and Guide: Context-Aware Real-Time Assistance in Augmented Reality)

The growing adoption of augmented and virtual reality (AR and VR) technologies in industrial training and on-the-job assistance has created new opportunities for intelligent, context-aware support systems. As workers perform complex tasks guided by AR and VR, these devices capture rich streams of multimodal data, including gaze, hand actions, and task progression, that can reveal user intent and task state in real time. Leveraging this information effectively remains a major challenge. In this work, we present a context-aware large language model (LLM) assistant that integrates diverse data modalities, such as hand actions, task steps, and dialogue history, into a unified framework for real-time question answering. To systematically study how context influences performance, we introduce an incremental prompting framework, where each model version receives progressively richer contextual inputs. Using the HoloAssist dataset, which records AR-guided task executions, we evaluate how each modality contributes to the assistant's effectiveness. Our experiments show that incorporating multimodal context significantly improves the accuracy and relevance of responses. These findings highlight the potential of LLM-driven multimodal integration to enable adaptive, intuitive assistance for AR and VR-based industrial training and assistance.

翻译：增强现实和虚拟现实技术在工业培训和现场作业辅助中的日益普及，为智能化的上下文感知支持系统创造了新的机遇。当工作人员在AR和VR引导下执行复杂任务时，这些设备能够捕获丰富的多模态数据流，包括视线追踪、手部动作和任务进展，从而实时揭示用户意图和任务状态。如何有效利用这些信息仍然是一个重大挑战。本研究提出了一种上下文感知的大语言模型辅助系统，该系统将手部动作、任务步骤和对话历史等多种数据模态整合到统一的实时问答框架中。为系统研究上下文如何影响性能，我们引入了渐进式提示框架，其中每个模型版本逐步接收更丰富的上下文输入。利用记录AR引导任务执行的HoloAssist数据集，我们评估了每种模态对辅助系统效能的贡献。实验结果表明，融入多模态上下文能显著提升回答的准确性和相关性。这些发现突显了基于大语言模型的多模态整合在实现自适应、直观的AR与VR工业培训及辅助方面的潜力。