通过人类 -- -- AI联合试点优化,有效学习安全驾驶政策 (Efficient Learning of Safe Driving Policy via Human-AI Copilot Optimization)

Human intervention is an effective way to inject human knowledge into the training loop of reinforcement learning, which can bring fast learning and ensured training safety. Given the very limited budget of human intervention, it remains challenging to design when and how human expert interacts with the learning agent in the training. In this work, we develop a novel human-in-the-loop learning method called Human-AI Copilot Optimization (HACO).To allow the agent's sufficient exploration in the risky environments while ensuring the training safety, the human expert can take over the control and demonstrate how to avoid probably dangerous situations or trivial behaviors. The proposed HACO then effectively utilizes the data both from the trial-and-error exploration and human's partial demonstration to train a high-performing agent. HACO extracts proxy state-action values from partial human demonstration and optimizes the agent to improve the proxy values meanwhile reduce the human interventions. The experiments show that HACO achieves a substantially high sample efficiency in the safe driving benchmark. HACO can train agents to drive in unseen traffic scenarios with a handful of human intervention budget and achieve high safety and generalizability, outperforming both reinforcement learning and imitation learning baselines with a large margin. Code and demo videos are available at: https://decisionforce.github.io/HACO/.

翻译：人类干预是将人类知识注入强化学习培训循环的有效途径,这可以带来快速学习和确保培训安全。鉴于人类干预的预算非常有限,设计人类专家与培训人员互动的时间和方式仍然具有挑战性。在这项工作中,我们开发了一种新型的“人与人间交流学习”方法,名为“Human-AI Joint Opptimination(HACO)”。为了允许该人员在风险环境中充分探索,同时确保培训安全,人类专家可以接管控制并展示如何避免可能的危险情况或微不足道的行为。拟议的HaCO然后有效地利用试验与eroror探索和人类部分演示中的数据,以培训高性能的代理人员。HaCO从部分人类演示中提取了代理性国家行动价值,并优化了代理性学习方法,同时减少了人类干预。实验表明HaCO在安全驾驶基准中取得了相当高的样本效率。HaCO可以培训代理人员以少量的人类干预预算在看不见的交通情景下驾驶,并实现了高度安全和可操作性。在高安全/可操作性上,超越了可操作性。在可操作性上进行强化的AHADD/CRismagium