In light of the dynamic nature of autonomous driving environments and stringent safety requirements, general MLLMs combined with CLIP alone often struggle to accurately represent driving-specific scenarios, particularly in complex interactions and long-tail cases. To address this, we propose the Hints of Prompt (HoP) framework, which introduces three key enhancements: Affinity hint to emphasize instance-level structure by strengthening token-wise connections, Semantic hint to incorporate high-level information relevant to driving-specific cases, such as complex interactions among vehicles and traffic signs, and Question hint to align visual features with the query context, focusing on question-relevant regions. These hints are fused through a Hint Fusion module, enriching visual representations by capturing driving-related representations with limited domain data, ensuring faster adaptation to driving scenarios. Extensive experiments confirm the effectiveness of the HoP framework, showing that it significantly outperforms previous state-of-the-art methods in all key metrics.
翻译:鉴于自动驾驶环境的动态特性和严格的安全要求,仅结合CLIP的通用多模态大语言模型往往难以准确表征驾驶特定场景,尤其是在复杂交互和长尾案例中。为此,我们提出了提示线索框架,该框架引入了三项关键增强:亲和性线索通过强化令牌间连接来强调实例级结构,语义性线索用于融入与驾驶特定案例相关的高层信息(例如车辆与交通标志间的复杂交互),以及问题性线索使视觉特征与查询语境对齐,聚焦于问题相关区域。这些线索通过线索融合模块进行融合,利用有限领域数据捕获驾驶相关表征,从而丰富视觉表示,确保更快适应驾驶场景。大量实验证实了提示线索框架的有效性,表明其在所有关键指标上均显著优于先前的最先进方法。