从偏向中推导按语法顺序排列的奖赏 (Inferring Lexicographically-Ordered Rewards from Preferences)

Modeling the preferences of agents over a set of alternatives is a principal concern in many areas. The dominant approach has been to find a single reward/utility function with the property that alternatives yielding higher rewards are preferred over alternatives yielding lower rewards. However, in many settings, preferences are based on multiple, often competing, objectives; a single reward function is not adequate to represent such preferences. This paper proposes a method for inferring multi-objective reward-based representations of an agent's observed preferences. We model the agent's priorities over different objectives as entering lexicographically, so that objectives with lower priorities matter only when the agent is indifferent with respect to objectives with higher priorities. We offer two example applications in healthcare, one inspired by cancer treatment, the other inspired by organ transplantation, to illustrate how the lexicographically-ordered rewards we learn can provide a better understanding of a decision-maker's preferences and help improve policies when used in reinforcement learning.

翻译：在许多领域,以代理人的偏好为样板,这都是一个主要关切的领域。主要做法是找到单一的奖励/效用功能,财产中产生较高奖赏的替代办法优于产生较低奖赏的替代办法;然而,在许多环境中,偏好是基于多重的、往往相互竞争的目标;单一的奖赏功能不足以代表这种偏好。本文件提出了一种方法,用以推断一个代理人所观察到的偏好以多目标为基础的奖赏表示方式。我们把代理人对不同目标的优先考虑作为进入地名录,这样,只有当代理人对较高优先的目标漠不关心时,才使较低优先的目标变得重要。我们在保健领域举了两个例子,一个是癌症治疗引起的,另一个是器官移植的启发,以说明我们所学的按地名录排列的奖项如何能够更好地理解决策者的偏好,并在加强学习时帮助改进政策。