We study how to exploit dense simulator-defined rewards in vision-based autonomous driving without inheriting their misalignment with deployment metrics. In realistic simulators such as CARLA, privileged state (e.g., lane geometry, infractions, time-to-collision) can be converted into dense rewards that stabilize and accelerate model-based reinforcement learning, but policies trained directly on these signals often overfit and fail to generalize when evaluated on sparse objectives such as route completion and collision-free overtaking. We propose reward-privileged world model distillation, a two-stage framework in which a teacher DreamerV3-style agent is first trained with a dense privileged reward, and only its latent dynamics are distilled into a student trained solely on sparse task rewards. Teacher and student share the same observation space (semantic bird's-eye-view images); privileged information enters only through the teacher's reward, and the student does not imitate the teacher's actions or value estimates. Instead, the student's world model is regularized to match the teacher's latent dynamics while its policy is learned from scratch on sparse success/failure signals. In CARLA lane-following and overtaking benchmarks, sparse-reward students outperform both dense-reward teachers and sparse-from-scratch baselines. On unseen lane-following routes, reward-privileged distillation improves success by about 23 percent relative to the dense teacher while maintaining comparable or better safety. On overtaking, students retain near-perfect performance on training routes and achieve up to a 27x improvement in success on unseen routes, with improved lane keeping. These results show that dense rewards can be leveraged to learn richer dynamics models while keeping the deployed policy optimized strictly for sparse, deployment-aligned objectives.
翻译:本研究探讨如何在基于视觉的自动驾驶中利用仿真器定义的密集奖励,同时避免继承其与部署指标之间的失配问题。在CARLA等真实仿真器中,特权状态(如车道几何、违规行为、碰撞时间)可转化为密集奖励,从而稳定并加速基于模型的强化学习;但直接基于这些信号训练的策略在评估稀疏目标(如路线完成度和无碰撞超车)时往往过拟合并难以泛化。我们提出奖励特权世界模型蒸馏框架,该两阶段方法首先使用密集特权奖励训练教师DreamerV3智能体,随后仅将其潜在动力学蒸馏至仅通过稀疏任务奖励训练的学生智能体。教师与学生共享相同的观测空间(语义鸟瞰图像);特权信息仅通过教师奖励输入,学生不模仿教师的动作或价值估计。相反,学生世界模型通过匹配教师潜在动力学进行正则化,其策略则完全基于稀疏成功/失败信号从头学习。在CARLA车道跟随与超车基准测试中,稀疏奖励学生智能体在性能上同时超越密集奖励教师和从头训练的稀疏奖励基线。在未见过的车道跟随路线上,奖励特权蒸馏相较密集奖励教师将成功率提升约23%,同时保持相当或更优的安全性。在超车任务中,学生智能体在训练路线上保持近乎完美的性能,并在未见路线上实现高达27倍的成功率提升,同时具有更优的车道保持能力。这些结果表明,密集奖励可用于学习更丰富的动力学模型,同时确保部署策略严格针对稀疏且与部署目标对齐的指标进行优化。