拆包奖励形状:了解奖励工程对样品复杂度的效益 (Unpacking Reward Shaping: Understanding the Benefits of Reward Engineering on Sample Complexity)

Reinforcement learning provides an automated framework for learning behaviors from high-level reward specifications, but in practice the choice of reward function can be crucial for good results -- while in principle the reward only needs to specify what the task is, in reality practitioners often need to design more detailed rewards that provide the agent with some hints about how the task should be completed. The idea of this type of ``reward-shaping'' has been often discussed in the literature, and is often a critical part of practical applications, but there is relatively little formal characterization of how the choice of reward shaping can yield benefits in sample complexity. In this work, we build on the framework of novelty-based exploration to provide a simple scheme for incorporating shaped rewards into RL along with an analysis tool to show that particular choices of reward shaping provably improve sample efficiency. We characterize the class of problems where these gains are expected to be significant and show how this can be connected to practical algorithms in the literature. We confirm that these results hold in practice in an experimental evaluation, providing an insight into the mechanisms through which reward shaping can significantly improve the complexity of reinforcement learning while retaining asymptotic performance.

翻译：强化学习为学习高层次奖励规定的行为提供了一个自动框架,但在实践中,选择奖赏职能对于取得良好结果至关重要 -- -- 原则上,奖赏仅需要说明任务是什么,而原则上,奖赏仅需要说明任务是什么,在现实中,从业者往往需要设计更详细的奖励办法,向代理人提供关于如何完成这项任务的一些提示。文献中经常讨论过这种“奖励分红”的概念,这种“奖励分红”的概念往往是实际应用中的一个关键部分,但对于在抽样复杂情况下如何选择奖赏产生效益,却很少正式定性。在这项工作中,我们利用基于新颖的探索框架,提供一个简单的计划,将成形的奖项纳入学习清单,同时提供分析工具,以显示在提高抽样效率方面特别的奖赏选择。我们描述预期这些收益将具有重大意义的各类问题,并表明这些收益如何与文献中的实际算法相联系。我们确认,这些结果在实验性评估中保留在实践中,对奖赏机制的洞察,从而大大改进强化学习的复杂性,同时保持零缓性业绩。