Bayesian Reinforcement Learning (BRL) provides a framework for generalisation of Reinforcement Learning (RL) problems from its use of Bayesian task parameters in the transition and reward models. However, classical BRL methods assume known forms of transition and reward models, reducing their applicability in real-world problems. As a result, recent deep BRL methods have started to incorporate model learning, though the use of neural networks directly on the joint data and task parameters requires optimising the Evidence Lower Bound (ELBO). ELBOs are difficult to optimise and may result in indistinctive task parameters, hence compromised BRL policies. To this end, we introduce a novel deep BRL method, Generalised Linear Models in Deep Bayesian RL with Learnable Basis Functions (GLiBRL), that enables efficient and accurate learning of transition and reward models, with fully tractable marginal likelihood and Bayesian inference on task parameters and model noises. On challenging MetaWorld ML10/45 benchmarks, GLiBRL improves the success rate of one of the state-of-the-art deep BRL methods, VariBAD, by up to 2.7x. Comparing against representative or recent deep BRL / Meta-RL methods, such as MAML, RL2, SDVT, TrMRL and ECET, GLiBRL also demonstrates its low-variance and decent performance consistently.
翻译:贝叶斯强化学习(BRL)通过在其转移模型和奖励模型中使用贝叶斯任务参数,为强化学习(RL)问题的泛化提供了一个框架。然而,经典的BRL方法假设转移模型和奖励模型的形式已知,这降低了其在现实问题中的适用性。因此,近期的深度BRL方法已开始融入模型学习,尽管直接在联合数据与任务参数上使用神经网络需要优化证据下界(ELBO)。ELBO难以优化,并可能导致任务参数缺乏区分度,从而损害BRL策略的性能。为此,我们提出了一种新颖的深度BRL方法——具有可学习基函数的深度贝叶斯强化学习中的广义线性模型(GLiBRL),该方法能够高效且准确地学习转移模型和奖励模型,并具有完全可处理的边际似然以及对任务参数和模型噪声的贝叶斯推断。在具有挑战性的MetaWorld ML10/45基准测试中,GLiBRL将当前最先进的深度BRL方法之一VariBAD的成功率提高了高达2.7倍。与代表性的或近期的深度BRL/元强化学习方法(如MAML、RL2、SDVT、TrMRL和ECET)相比,GLiBRL也始终展现出其低方差和良好的性能。