Bayesian inference over the reward presents an ideal solution to the ill-posed nature of the inverse reinforcement learning problem. Unfortunately current methods generally do not scale well beyond the small tabular setting due to the need for an inner-loop MDP solver, and even non-Bayesian methods that do themselves scale often require extensive interaction with the environment to perform well, being inappropriate for high stakes or costly applications such as healthcare. In this paper we introduce our method, Approximate Variational Reward Imitation Learning (AVRIL), that addresses both of these issues by jointly learning an approximate posterior distribution over the reward that scales to arbitrarily complicated state spaces alongside an appropriate policy in a completely offline manner through a variational approach to said latent reward. Applying our method to real medical data alongside classic control simulations, we demonstrate Bayesian reward inference in environments beyond the scope of current methods, as well as task performance competitive with focused offline imitation learning algorithms.
翻译:贝叶斯对奖赏的推论是解决反强化学习问题弊端的一个理想的解决方案。 不幸的是,目前的方法一般不会超过小表格设置的范围,因为需要一个内环 MDP解答器,甚至非巴伊西亚方法本身的规模也往往需要与环境广泛互动才能良好地发挥作用,不适合高风险或医疗等昂贵的应用。 在本文中,我们引入了我们的方法,即 " 近似挥发性反射学习(AVRIL ) ",通过共同学习关于该奖赏的近似后方分布,即对任意复杂的州空间进行近似后方分布,同时通过变式方法对所述潜在奖赏采取适当政策。在典型的控制模拟中将我们的方法应用于真正的医疗数据,我们展示了贝伊斯人在当前方法范围之外的环境中的奖赏性推断,并展示了与集中的离线模拟学习算法相比的任务性竞争。