Despite significant progress in challenging problems across various domains, applying state-of-the-art deep reinforcement learning (RL) algorithms remains challenging due to their sensitivity to the choice of hyperparameters. This sensitivity can partly be attributed to the non-stationarity of the RL problem, potentially requiring different hyperparameter settings at various stages of the learning process. Additionally, in the RL setting, hyperparameter optimization (HPO) requires a large number of environment interactions, hindering the transfer of the successes in RL to real-world applications. In this work, we tackle the issues of sample-efficient and dynamic HPO in RL. We propose a population-based automated RL (AutoRL) framework to meta-optimize arbitrary off-policy RL algorithms. In this framework, we optimize the hyperparameters and also the neural architecture while simultaneously training the agent. By sharing the collected experience across the population, we substantially increase the sample efficiency of the meta-optimization. We demonstrate the capabilities of our sample-efficient AutoRL approach in a case study with the popular TD3 algorithm in the MuJoCo benchmark suite, where we reduce the number of environment interactions needed for meta-optimization by up to an order of magnitude compared to population-based training.
翻译:尽管在各领域挑战性问题上取得了显著进展,但应用最先进的深层强化学习算法仍然具有挑战性,原因是它们敏感地选择了超参数。这种敏感性部分归因于RL问题不常态,在学习过程的各个阶段可能需要不同的超参数设置。此外,在RL设置中,超参数优化需要大量的环境互动,从而妨碍将RL的成功转化为现实世界应用。在这项工作中,我们解决了在RL中样本高效和动态HPO的问题。我们提议了一个基于人口的自动RL(AutoRL)自动框架,将任意的不受政策限制的RL算法实现元优化。在这个框架内,我们优化超参数和神经结构,同时培训代理商。通过分享收集到的人口经验,我们大大提高了元优化的样本效率。我们展示了我们在一项案例研究中采用基于广受欢迎的TD3算法的方法的能力。我们提议了一个基于穆约科基准套房的自动自动自动算法,通过将所需的人口排序降低到可进行元化环境的比额。