Trajectory planning is a fundamental yet challenging component of autonomous driving. End-to-end planners frequently falter under adverse weather, unpredictable human behavior, or complex road layouts, primarily because they lack strong generalization or few-shot capabilities beyond their training data. We propose LLaViDA, a Large Language Vision Driving Assistant that leverages a Vision-Language Model (VLM) for object motion prediction, semantic grounding, and chain-of-thought reasoning for trajectory planning in autonomous driving. A two-stage training pipeline--supervised fine-tuning followed by Trajectory Preference Optimization (TPO)--enhances scene understanding and trajectory planning by injecting regression-based supervision, produces a powerful "VLM Trajectory Planner for Autonomous Driving." On the NuScenes benchmark, LLaViDA surpasses state-of-the-art end-to-end and other recent VLM/LLM-based baselines in open-loop trajectory planning task, achieving an average L2 trajectory error of 0.31 m and a collision rate of 0.10% on the NuScenes test set. The code for this paper is available at GitHub.
翻译:轨迹规划是自动驾驶领域基础且具有挑战性的组成部分。端到端规划器在恶劣天气、不可预测的人类行为或复杂道路布局下常常表现不佳,这主要源于其在训练数据之外缺乏强大的泛化或小样本能力。我们提出了LLaViDA,一种大语言视觉驾驶助手,它利用视觉语言模型进行目标运动预测、语义接地,并通过思维链推理实现自动驾驶的轨迹规划。一个两阶段的训练流程——监督微调后接轨迹偏好优化——通过注入基于回归的监督来增强场景理解和轨迹规划能力,从而产生一个强大的“用于自动驾驶的VLM轨迹规划器”。在NuScenes基准测试中,LLaViDA在开环轨迹规划任务上超越了最先进的端到端方法以及其他近期基于VLM/LLM的基线,在NuScenes测试集上实现了0.31米的平均L2轨迹误差和0.10%的碰撞率。本文代码已在GitHub上开源。