Can an arbitrarily intelligent reinforcement learning agent be kept under control by a human user? Or do agents with sufficient intelligence inevitably find ways to shortcut their reward signal? This question impacts how far reinforcement learning can be scaled, and whether alternative paradigms must be developed in order to build safe artificial general intelligence. In this paper, we use an intuitive yet precise graphical model called causal influence diagrams to formalize reward tampering problems. We also describe a number of modifications to the reinforcement learning objective that prevent incentives for reward tampering. We verify the solutions using recently developed graphical criteria for inferring agent incentives from causal influence diagrams. Along the way, we also compare corrigibility and self-preservation properties of the various solutions, and discuss how they can be combined into a single agent without reward tampering incentives.
翻译:任意智能强化学习代理能否被人类用户控制? 或者,拥有足够情报的代理商能否不可避免地找到捷径的奖励信号? 这个问题影响到强化学习的规模,以及是否必须开发替代模式以建立安全的人工一般智能。 在本文中,我们使用直观而精确的图形模型,称为因果影响图,以正式确定奖赏篡改问题。 我们还描述了对强化学习目标的一些修改,以防止奖励篡改的奖励措施。 我们用最近制定的图形标准来根据因果影响图表推断代理人的奖励措施,来核查解决方案。 同时,我们还比较了各种解决办法的可渗透性和自我保护性,并讨论如何在不奖励篡改奖励措施的情况下将它们合并成单一的代理商。