Agents are systems that optimize an objective function in an environment. Together, the goal and the environment induce secondary objectives, incentives. Modeling the agent-environment interaction using causal influence diagrams, we can answer two fundamental questions about an agent's incentives directly from the graph: (1) which nodes can the agent have an incentivize to observe, and (2) which nodes can the agent have an incentivize to control? The answers tell us which information and influence points need extra protection. For example, we may want a classifier for job applications to not use the ethnicity of the candidate, and a reinforcement learning agent not to take direct control of its reward mechanism. Different algorithms and training paradigms can lead to different causal influence diagrams, so our method can be used to identify algorithms with problematic incentives and help in designing algorithms with better incentives.
翻译:代理是优化环境客观功能的系统。 目标和环境一起诱发次要目标, 激励因素。 利用因果影响图来模拟代理- 环境互动, 我们可以直接从图表中回答两个关于代理的激励因素的基本问题:(1) 代理能有哪个节点可以被激励观察, (2) 代理能有哪个节点可以被激励来控制? 答案告诉我们哪些信息和影响点需要额外的保护。 比如, 我们可能希望工作应用程序的分类者不使用候选人的族裔,而强化学习代理人不直接控制其奖赏机制。 不同的算法和培训模式可以导致不同的因果影响图表, 因此我们的方法可以用来识别有问题激励因素的算法,并帮助设计有更好激励因素的算法。