Reinforcement learning agents need a reward signal to learn successful policies. When this signal is sparse or the corresponding gradient is deceptive, such agents need a dedicated mechanism to efficiently explore their search space without relying on the reward. Looking for a large diversity of behaviors or using Motion Planning (MP) algorithms are two options in this context. In this paper, we build on the common roots between these two options to investigate the properties of two diversity search algorithms, the Novelty Search and the Goal Exploration Process algorithms. These algorithms look for diversity in an outcome space or behavioral space which is generally hand-designed to represent what matters for a given task. The relation to MP algorithms reveals that the smoothness, or lack of smoothness of the mapping between the policy parameter space and the outcome space plays a key role in the search efficiency. In particular, we show empirically that, if the mapping is smooth enough, i.e. if two close policies in the parameter space lead to similar outcomes, then diversity algorithms tend to inherit exploration properties of MP algorithms. By contrast, if it is not, diversity algorithms lose these properties and their performance strongly depends on specific heuristics, notably filtering mechanisms that discard some of the explored policies.
翻译:强化学习代理器需要一种奖励信号来学习成功的政策。 当这个信号稀少或者相应的梯度是欺骗性的, 这些代理器需要一种专门的机制来有效探索搜索空间而不依赖奖励。 寻找大量不同的行为或者使用运动规划算法是这方面的两个选项。 在本文中, 我们利用这两个选项之间的共同根基来调查两种多样性搜索算法的属性, 即新颖搜索和目标探索过程算法。 这些算法在结果空间或行为空间中寻找多样性, 通常由手工设计来代表特定任务的事项。 与 MP 算法的关系表明, 政策参数空间和结果空间之间的绘图的平滑性或者不顺利性在搜索效率中起着关键作用。 特别是, 我们从经验上表明, 如果绘图足够顺利, 即如果两个参数空间的密切政策导致类似的结果, 多样性算法往往会继承MP 算法的探索属性。 相反, 与某些多样性算法没有失去这些属性或者其性能在很大程度上取决于具体的黑质筛选机制, 。