Batch Reinforcement Learning (Batch RL) consists in training a policy using trajectories collected with another policy, called the behavioural policy. Safe policy improvement (SPI) provides guarantees with high probability that the trained policy performs better than the behavioural policy, also called baseline in this setting. Previous work shows that the SPI objective improves mean performance as compared to using the basic RL objective, which boils down to solving the MDP with maximum likelihood. Here, we build on that work and improve more precisely the SPI with Baseline Bootstrapping algorithm (SPIBB) by allowing the policy search over a wider set of policies. Instead of binarily classifying the state-action pairs into two sets (the \textit{uncertain} and the \textit{safe-to-train-on} ones), we adopt a softer strategy that controls the error in the value estimates by constraining the policy change according to the local model uncertainty. The method can take more risks on uncertain actions all the while remaining provably-safe, and is therefore less conservative than the state-of-the-art methods. We propose two algorithms (one optimal and one approximate) to solve this constrained optimization problem and empirically show a significant improvement over existing SPI algorithms both on finite MDPs and on infinite MDPs with a neural network function approximation.
翻译:加强批量学习(Batch RL) 包括培训一项使用另一政策(称为行为政策)收集的轨迹的政策。 安全政策改进(SPI) 提供了高概率的保证, 保证所培训的政策比行为政策效果好, 也在此背景下被称为基线 。 先前的工作表明, SPI 目标比基本的 RL 目标提高了平均业绩, 而基本 RL 目标归结为解决 MDP 的可能性最大 。 在这里, 我们以这项工作为基础, 通过允许对更广泛的一套政策进行政策搜索, 更精确地改进使用基线 Boostrapting 算法( SPIB) 的 SPI 。 安全性政策改进提供了政策搜索。 安全性政策改进( SPI) 提供了高度的保证, 而不是将国家行动配对分为两类(\ textit{ uncertain} 和\ text- train- on} 。 我们采取了一种较软的战略, 控制价值估计错误, 根据当地模式不确定性限制政策变化。 方法可以在所有不确定的行动上承担更多的风险, 同时仍然比较安全, 保守, 因此比州- 内式的保守, 因此比州- 内- 节差的MDP 两种方法都显示一个最优的MLALADOL 。