批量模拟学习的坚实基线 (A Strong Baseline for Batch Imitation Learning)

Imitation of expert behaviour is a highly desirable and safe approach to the problem of sequential decision making. We provide an easy-to-implement, novel algorithm for imitation learning under a strict data paradigm, in which the agent must learn solely from data collected a priori. This paradigm allows our algorithm to be used for environments in which safety or cost are of critical concern. Our algorithm requires no additional hyper-parameter tuning beyond any standard batch reinforcement learning (RL) algorithm, making it an ideal baseline for such data-strict regimes. Furthermore, we provide formal sample complexity guarantees for the algorithm in finite Markov Decision Problems. In doing so, we formally demonstrate an unproven claim from Kearns & Singh (1998). On the empirical side, our contribution is twofold. First, we develop a practical, robust and principled evaluation protocol for offline RL methods, making use of only the dataset provided for model selection. This stands in contrast to the vast majority of previous works in offline RL, which tune hyperparameters on the evaluation environment, limiting the practical applicability when deployed in new, cost-critical environments. As such, we establish precedent for the development and fair evaluation of offline RL algorithms. Second, we evaluate our own algorithm on challenging continuous control benchmarks, demonstrating its practical applicability and competitiveness with state-of-the-art performance, despite being a simpler algorithm.

翻译：专家行为的消减是处理连续决策问题的一种非常可取和安全的方法。我们提供了一种在严格的数据模式下进行模仿学习的简单到执行的新型算法,其中代理人必须只从先验收集的数据中学习。这种模式使我们的算法能够用于安全或费用极为令人关切的环境。我们的算法除了任何标准的批量强化学习算法(RL)算法之外,不需要额外的超参数调,使它成为这种数据限制制度的理想基准。此外,我们为有限的马尔科夫决策问题的算法提供了正式的样本复杂性保证。在这样做时,我们正式展示了Kearns & Singh(1998年)的未经证实的主张。在经验方面,我们的贡献是双重的。首先,我们为离线的RL方法制定了实用、有力和有原则的评价程序,只使用为模型选择提供的数据集。这与离线的绝大多数前工作是相反的,它调控定了超标准环境,限制了在新的、费用危急环境中部署的算法的实际适用性。我们这样做是为了确立一个具有挑战性的标准,这是我们不断演算法的先例。