Offline reinforcement learning (RL) aims to learn a policy that maximizes the expected return using a given static dataset of transitions. However, offline RL faces the distribution shift problem. The policy constraint offline RL method is proposed to solve the distribution shift problem. During the policy constraint offline RL training, it is important to ensure the difference between the learned policy and behavior policy within a given threshold. Thus, the learned policy heavily relies on the quality of the behavior policy. However, a problem exists in existing policy constraint methods: if the dataset contains many low-reward transitions, the learned will be contained with a suboptimal reference policy, leading to slow learning speed, low sample efficiency, and inferior performances. This paper shows that the sampling method in policy constraint offline RL that uses all the transitions in the dataset can be improved. A simple but efficient sample filtering method is proposed to improve the sample efficiency and the final performance. First, we evaluate the score of the transitions by average reward and average discounted reward of episodes in the dataset and extract the transition samples of high scores. Second, the high-score transition samples are used to train the offline RL algorithms. We verify the proposed method in a series of offline RL algorithms and benchmark tasks. Experimental results show that the proposed method outperforms baselines.
翻译:离线强化学习(RL)旨在利用给定的静态转移数据集学习一个最大化期望回报的策略。然而,离线RL面临着分布偏移问题。为求解分布偏移问题,提出了策略约束离线RL方法。在策略约束离线RL训练过程中,确保学习策略与行为策略之间的差异在给定阈值内至关重要。因此,学习策略在很大程度上依赖于行为策略的质量。然而,现有策略约束方法存在一个问题:若数据集中包含大量低奖励转移,学习策略将受限于次优参考策略,导致学习速度缓慢、样本效率低下以及性能不佳。本文表明,在策略约束离线RL中使用数据集中所有转移的采样方法可以得到改进。我们提出了一种简单而高效的样本筛选方法,以提高样本效率和最终性能。首先,我们通过数据集中情节的平均奖励和平均折扣奖励来评估转移的得分,并提取高得分的转移样本。其次,利用这些高得分转移样本来训练离线RL算法。我们在系列离线RL算法和基准任务中验证了所提方法。实验结果表明,所提方法优于基线方法。