以模型为基础的反虚拟元-加强学习 (Model-based Adversarial Meta-Reinforcement Learning)

Meta-reinforcement learning (meta-RL) aims to learn from multiple training tasks the ability to adapt efficiently to unseen test tasks. Despite the success, existing meta-RL algorithms are known to be sensitive to the task distribution shift. When the test task distribution is different from the training task distribution, the performance may degrade significantly. To address this issue, this paper proposes Model-based Adversarial Meta-Reinforcement Learning (AdMRL), where we aim to minimize the worst-case sub-optimality gap -- the difference between the optimal return and the return that the algorithm achieves after adaptation -- across all tasks in a family of tasks, with a model-based approach. We propose a minimax objective and optimize it by alternating between learning the dynamics model on a fixed task and finding the adversarial task for the current model -- the task for which the policy induced by the model is maximally suboptimal. Assuming the family of tasks is parameterized, we derive a formula for the gradient of the suboptimality with respect to the task parameters via the implicit function theorem, and show how the gradient estimator can be efficiently implemented by the conjugate gradient method and a novel use of the REINFORCE estimator. We evaluate our approach on several continuous control benchmarks and demonstrate its efficacy in the worst-case performance over all tasks, the generalization power to out-of-distribution tasks, and in training and test time sample efficiency, over existing state-of-the-art meta-RL algorithms.

翻译：元加强学习(meta- RL) 旨在从多个培训任务中学习如何有效适应隐蔽测试任务的能力。尽管取得了成功, 已知现有的元RL算法对任务分布变化十分敏感。当测试任务分布与培训任务分布不同时, 性能可能会显著下降。为了解决这个问题, 本文提出了基于模型的双向元加强学习( AdMRL), 我们的目标是尽可能缩小最差的次最佳比例差距 -- -- 最优回报率和算法在适应后实现的回报率之间的差别 -- -- 在任务组合中, 采用基于模型的方法, 在所有任务组合中, 所有任务组合中, 最优回报率和算法在适应后实现的回报率差异 -- -- 在所有任务组合中, 我们提出一个小目标并优化它, 在一个固定任务中学习动态模型, 找到当前模型的对抗性任务 -- -- 模型所引发的政策最不优化。假设对各项任务进行参数的组合, 我们得出一个公式, 通过隐含的功能, 最优回报率培训基准, 和REDRA- drevelilal- trestal 分析方法, 如何在不断测试- trade- train- train- train- train- train- train- train- train- train- trade- trade- trevational- treval- treval- trade- treval- trade- trade- trade- trade- trevational- trevation 上, 上, laction- trial- trevation laction- treval- sal- treval- laction- treval- treval- sal- treval- sal- sal- sal- sal- sessal- sal