通过高效和强有力的信任区域优化提高强化学习的价值 (Value Enhancement of Reinforcement Learning via Efficient and Robust Trust Region Optimization)

Reinforcement learning (RL) is a powerful machine learning technique that enables an intelligent agent to learn an optimal policy that maximizes the cumulative rewards in sequential decision making. Most of methods in the existing literature are developed in \textit{online} settings where the data are easy to collect or simulate. Motivated by high stake domains such as mobile health studies with limited and pre-collected data, in this paper, we study \textit{offline} reinforcement learning methods. To efficiently use these datasets for policy optimization, we propose a novel value enhancement method to improve the performance of a given initial policy computed by existing state-of-the-art RL algorithms. Specifically, when the initial policy is not consistent, our method will output a policy whose value is no worse and often better than that of the initial policy. When the initial policy is consistent, under some mild conditions, our method will yield a policy whose value converges to the optimal one at a faster rate than the initial policy, achieving the desired ``value enhancement" property. The proposed method is generally applicable to any parametrized policy that belongs to certain pre-specified function class (e.g., deep neural networks). Extensive numerical studies are conducted to demonstrate the superior performance of our method.

翻译：强化学习( RL) 是一种强大的机器学习技术, 使智能代理商能够学习最佳政策, 在连续决策中最大限度地增加累积收益。现有文献中的大多数方法是在数据易于收集或模拟的\ textit{ online} 设置中开发的。本文中,我们研究具有有限和预收集数据的移动健康研究等高利害领域, 研究具有有限和预收集数据的强化学习方法。为了有效利用这些数据集来优化政策, 我们提出一种新的价值增强方法, 以改善由现有最新RL算法计算出来的某项初始政策的业绩。具体地说, 当初始政策不一致时, 我们的方法将产生一个价值不比初始政策更差而且往往更好的政策。当初始政策一致时, 在一些温和的条件下, 我们的方法将产生一种政策, 其价值比初始政策更快地接近最佳的政策, 实现理想的“ 价值增强” 属性。拟议的方法一般适用于属于某些前定义前高级网络的任何对应政策( 深层, 显示我们所执行的高级网络) 。