End-to-end models for autonomous driving hold the promise of learning complex behaviors directly from sensor data, but face critical challenges in safety and handling long-tail events. Reinforcement Learning (RL) offers a promising path to overcome these limitations, yet its success in autonomous driving has been elusive. We identify a fundamental flaw hindering this progress: a deep seated optimistic bias in the world models used for RL. To address this, we introduce a framework for post-training policy refinement built around an Impartial World Model. Our primary contribution is to teach this model to be honest about danger. We achieve this with a novel data synthesis pipeline, Counterfactual Synthesis, which systematically generates a rich curriculum of plausible collisions and off-road events. This transforms the model from a passive scene completer into a veridical forecaster that remains faithful to the causal link between actions and outcomes. We then integrate this Impartial World Model into our closed-loop RL framework, where it serves as an internal critic. During refinement, the agent queries the critic to ``dream" of the outcomes for candidate actions. We demonstrate through extensive experiments, including on a new Risk Foreseeing Benchmark, that our model significantly outperforms baselines in predicting failures. Consequently, when used as a critic, it enables a substantial reduction in safety violations in challenging simulations, proving that teaching a model to dream of danger is a critical step towards building truly safe and intelligent autonomous agents.
翻译:端到端自动驾驶模型有望直接从传感器数据中学习复杂行为,但在安全性和处理长尾事件方面面临关键挑战。强化学习为克服这些限制提供了一条有前景的路径,然而其在自动驾驶领域的成功一直难以实现。我们识别出阻碍这一进展的根本缺陷:用于强化学习的世界模型中存在根深蒂固的乐观偏差。为解决此问题,我们引入了一个围绕公正世界模型构建的后训练策略精炼框架。我们的核心贡献在于教导该模型对危险保持诚实。我们通过一种新颖的数据合成流程——反事实合成来实现这一目标,该流程系统性地生成包含合理碰撞与偏离道路事件的丰富课程。这将模型从被动的场景补全器转变为真实预测器,忠实保持动作与结果之间的因果关系。随后,我们将此公正世界模型集成至闭环强化学习框架中,使其作为内部评判器。在精炼过程中,智能体通过查询该评判器来“预演”候选动作的潜在结果。我们通过大量实验(包括在新提出的风险预见基准测试中)证明,该模型在预测失败方面显著优于基线方法。因此,当作为评判器使用时,它能在具有挑战性的仿真环境中大幅降低安全违规行为,这证实了教导模型预见危险是构建真正安全智能自动驾驶系统的关键一步。