Conventional end-to-end (E2E) driving models are effective at generating physically plausible trajectories, but often fail to generalize to long-tail scenarios due to the lack of essential world knowledge to understand and reason about surrounding environments. In contrast, Vision-Language-Action (VLA) models leverage world knowledge to handle challenging cases, but their limited 3D reasoning capability can lead to physically infeasible actions. In this work we introduce DiffVLA++, an enhanced autonomous driving framework that explicitly bridges cognitive reasoning and E2E planning through metric-guided alignment. First, we build a VLA module directly generating semantically grounded driving trajectories. Second, we design an E2E module with a dense trajectory vocabulary that ensures physical feasibility. Third, and most critically, we introduce a metric-guided trajectory scorer that guides and aligns the outputs of the VLA and E2E modules, thereby integrating their complementary strengths. The experiment on the ICCV 2025 Autonomous Grand Challenge leaderboard shows that DiffVLA++ achieves EPDMS of 49.12.
翻译:传统的端到端(E2E)驾驶模型在生成物理上合理的轨迹方面表现有效,但由于缺乏理解与推理周围环境所必需的世界知识,往往难以泛化至长尾场景。相比之下,视觉-语言-动作(VLA)模型利用世界知识来处理具有挑战性的案例,但其有限的3D推理能力可能导致物理上不可行的动作。在本工作中,我们提出了DiffVLA++,一个通过度量引导对齐显式桥接认知推理与E2E规划的增强型自动驾驶框架。首先,我们构建了一个直接生成语义基础驾驶轨迹的VLA模块。其次,我们设计了一个具有密集轨迹词汇的E2E模块,以确保物理可行性。第三,也是最为关键的是,我们引入了一个度量引导的轨迹评分器,用于指导和协调VLA与E2E模块的输出,从而整合它们的互补优势。在ICCV 2025自动驾驶大挑战排行榜上的实验表明,DiffVLA++实现了49.12的EPDMS分数。