Reinforcement learning (RL) agents optimize only the features specified in a reward function and are indifferent to anything left out inadvertently. This means that we must not only specify what to do, but also the much larger space of what not to do. It is easy to forget these preferences, since these preferences are already satisfied in our environment. This motivates our key insight: when a robot is deployed in an environment that humans act in, the state of the environment is already optimized for what humans want. We can therefore use this implicit preference information from the state to fill in the blanks. We develop an algorithm based on Maximum Causal Entropy IRL and use it to evaluate the idea in a suite of proof-of-concept environments designed to show its properties. We find that information from the initial state can be used to infer both side effects that should be avoided as well as preferences for how the environment should be organized. Our code can be found at https://github.com/HumanCompatibleAI/rlsp.
翻译:强化学习( RL) 代理器只优化奖励功能中所指定的特性, 并且对无意中遗漏的任何事物漠不关心。 这意味着我们不仅必须指定要做什么, 还要指定更多的不做什么的空间。 很容易忘记这些偏好, 因为这些偏好已经在我们的环境里得到满足。 这促使我们关键的洞察力: 当机器人在人类行动的环境里被部署时, 环境状况已经被优化到人类想要的东西 。 因此, 我们可以使用国家提供的这种隐含偏好信息来填充空白。 我们开发了一种基于最大 Causal Entropy IRL 的算法, 并用它来评估一套旨在显示其属性的验证概念环境中的想法 。 我们发现, 最初状态的信息可以用来推断两种应该避免的副作用, 以及环境组织方式的偏好 。 我们的代码可以在 https://github.com/ HumannicopleAI/rlsp 中找到 。