通过反向强化学习为人类学习提供奖励 (Reward Shaping for Human Learning via Inverse Reinforcement Learning)

from arxiv, This paper has been modified considerably for resubmission to Journal of Machine Learning Research, for source code, see https://github.com/mrucker/kpirl-kla

Humans are spectacular reinforcement learners, constantly learning from and adjusting to experience and feedback. Unfortunately, this doesn't necessarily mean humans are fast learners. When tasks are challenging, learning can become unacceptably slow. Fortunately, humans do not have to learn tabula rasa, and learning speed can be greatly increased with learning aids. In this work we validate a new type of learning aid -- reward shaping for humans via inverse reinforcement learning (IRL). The goal of this aid is to increase the speed with which humans can learn good policies for specific tasks. Furthermore this approach compliments alternative machine learning techniques such as safety features that try to prevent individuals from making poor decisions. To achieve our results we first extend a well known IRL algorithm via kernel methods. Afterwards we conduct two human subjects experiments using an online game where players have limited time to learn a good policy. We show with statistical significance that players who receive our learning aid are able to approach desired policies more quickly than the control group.

翻译：人类是壮观的强化学习者, 不断学习并适应经验和反馈。不幸的是, 这并不一定意味着人类是快速学习者。当任务具有挑战性时, 学习会变得令人无法接受地缓慢。幸运的是, 人类不需要学习 tabula rasa, 学习速度会因学习辅助工具而大大加快。在此工作中, 我们验证了一种新的学习援助类型 -- -- 通过反向强化学习(IRL)来奖励塑造人类。此项援助的目的是提高人类学习特定任务的良好政策的速度。此外, 这种方法还补充了替代的机器学习技术, 比如安全功能, 试图防止个人做出糟糕的决定。为了实现我们的成果, 我们首先通过内核方法推广一个广为人知的IRL算法。然后我们用一个在线游戏来进行两个人类主题实验, 玩家在网上游戏上学习好政策的时间有限。我们从统计学帮助的参与者能够比控制组更快地处理所希望的政策。