Preference-based Reinforcement Learning (PbRL) methods utilize binary feedback from the human in the loop (HiL) over queried trajectory pairs to learn a reward model in an attempt to approximate the human's underlying reward function capturing their preferences. In this work, we investigate the issue of a high degree of variability in the initialized reward models which are sensitive to random seeds of the experiment. This further compounds the issue of degenerate reward functions PbRL methods already suffer from. We propose a data-driven reward initialization method that does not add any additional cost to the human in the loop and negligible cost to the PbRL agent and show that doing so ensures that the predicted rewards of the initialized reward model are uniform in the state space and this reduces the variability in the performance of the method across multiple runs and is shown to improve the overall performance compared to other initialization methods.
翻译:以优惠为基础的强化学习(PbRL)方法利用人对被询问的轨迹配对的循环(HIL)的二进制反馈,学习一种奖励模式,以试图接近人的基本奖赏功能,捕捉到他们的偏好。在这项工作中,我们调查了对实验随机种子敏感的初始奖赏模式存在高度差异的问题。这进一步加剧了无效奖励功能PbRL方法已经受到影响的问题。我们提议了一种数据驱动奖赏初始化方法,该方法不会给循环中的人类增加任何额外成本,而给PbRL代理商带来微不足道的成本,并表明这样做可以确保初始奖赏模式的预期回报在州空间是统一的,从而减少该方法在多个运行过程中的可变性,并表明与其他初始化方法相比,总体绩效有所改善。