提示线索：增强自动驾驶中多模态大语言模型的视觉表征 (Hints of Prompt: Enhancing Visual Representation for Multimodal LLMs in Autonomous Driving)

In light of the dynamic nature of autonomous driving environments and stringent safety requirements, general MLLMs combined with CLIP alone often struggle to accurately represent driving-specific scenarios, particularly in complex interactions and long-tail cases. To address this, we propose the Hints of Prompt (HoP) framework, which introduces three key enhancements: Affinity hint to emphasize instance-level structure by strengthening token-wise connections, Semantic hint to incorporate high-level information relevant to driving-specific cases, such as complex interactions among vehicles and traffic signs, and Question hint to align visual features with the query context, focusing on question-relevant regions. These hints are fused through a Hint Fusion module, enriching visual representations by capturing driving-related representations with limited domain data, ensuring faster adaptation to driving scenarios. Extensive experiments confirm the effectiveness of the HoP framework, showing that it significantly outperforms previous state-of-the-art methods in all key metrics.

翻译：鉴于自动驾驶环境的动态特性和严格的安全要求，仅结合CLIP的通用多模态大语言模型往往难以准确表征驾驶特定场景，尤其是在复杂交互和长尾案例中。为此，我们提出了提示线索框架，该框架引入了三项关键增强：亲和性线索通过强化令牌间连接来强调实例级结构，语义性线索用于融入与驾驶特定案例相关的高层信息（例如车辆与交通标志间的复杂交互），以及问题性线索使视觉特征与查询语境对齐，聚焦于问题相关区域。这些线索通过线索融合模块进行融合，利用有限领域数据捕获驾驶相关表征，从而丰富视觉表示，确保更快适应驾驶场景。大量实验证实了提示线索框架的有效性，表明其在所有关键指标上均显著优于先前的最先进方法。