Recent Vision-Language-Action (VLA) models for autonomous driving explore inference-time reasoning as a way to improve driving performance and safety in challenging scenarios. Most prior work uses natural language to express chain-of-thought (CoT) reasoning before producing driving actions. However, text may not be the most efficient representation for reasoning. In this work, we present Latent-CoT-Drive (LCDrive): a model that expresses CoT in a latent language that captures possible outcomes of the driving actions being considered. Our approach unifies CoT reasoning and decision making by representing both in an action-aligned latent space. Instead of natural language, the model reasons by interleaving (1) action-proposal tokens, which use the same vocabulary as the model's output actions; and (2) world model tokens, which are grounded in a learned latent world model and express future outcomes of these actions. We cold start latent CoT by supervising the model's action proposals and world model tokens based on ground-truth future rollouts of the scene. We then post-train with closed-loop reinforcement learning to strengthen reasoning capabilities. On a large-scale end-to-end driving benchmark, LCDrive achieves faster inference, better trajectory quality, and larger improvements from interactive reinforcement learning compared to both non-reasoning and text-reasoning baselines.
翻译:近期用于自动驾驶的视觉-语言-动作(VLA)模型探索了推理时思维链(CoT)作为提升复杂场景下驾驶性能和安全性的方法。大多数先前工作使用自然语言在生成驾驶动作前表达思维链推理。然而,文本可能并非最高效的推理表征形式。本研究提出Latent-CoT-Drive(LCDrive):一种在潜在语言中表达思维链的模型,该语言能捕捉所考虑驾驶动作的可能结果。我们的方法通过在动作对齐的潜在空间中统一表征思维链推理与决策制定,替代自然语言推理。模型通过交错使用两种标记进行推理:(1)动作提议标记——使用与模型输出动作相同的词汇表;(2)世界模型标记——基于学习的潜在世界模型构建,表达这些动作的未来结果。我们通过基于场景真实未来推演的监督来冷启动潜在思维链,对模型的动作提议和世界模型标记进行训练。随后采用闭环强化学习进行后训练以增强推理能力。在大规模端到端驾驶基准测试中,相较于非推理与文本推理基线模型,LCDrive实现了更快的推理速度、更优的轨迹质量,并通过交互式强化学习获得了更大的性能提升。