关于连续控制中重尾政策搜索的复杂性和可探测性 (On the Sample Complexity and Metastability of Heavy-tailed Policy Search in Continuous Control)

Reinforcement learning is a framework for interactive decision-making with incentives sequentially revealed across time without a system dynamics model. Due to its scaling to continuous spaces, we focus on policy search where one iteratively improves a parameterized policy with stochastic policy gradient (PG) updates. In tabular Markov Decision Problems (MDPs), under persistent exploration and suitable parameterization, global optimality may be obtained. By contrast, in continuous space, the non-convexity poses a pathological challenge as evidenced by existing convergence results being mostly limited to stationarity or arbitrary local extrema. To close this gap, we step towards persistent exploration in continuous space through policy parameterizations defined by distributions of heavier tails defined by tail-index parameter alpha, which increases the likelihood of jumping in state space. Doing so invalidates smoothness conditions of the score function common to PG. Thus, we establish how the convergence rate to stationarity depends on the policy's tail index alpha, a Holder continuity parameter, integrability conditions, and an exploration tolerance parameter introduced here for the first time. Further, we characterize the dependence of the set of local maxima on the tail index through an exit and transition time analysis of a suitably defined Markov chain, identifying that policies associated with Levy Processes of a heavier tail converge to wider peaks. This phenomenon yields improved stability to perturbations in supervised learning, which we corroborate also manifests in improved performance of policy search, especially when myopic and farsighted incentives are misaligned.

翻译：强化学习是互动决策的框架,其激励是连续地在没有系统动态模型的情况下不断显示的。由于它向连续空间的扩展,我们侧重于政策搜索,在政策搜索中,一个迭代改进了带有随机政策梯度(PG)更新的参数性政策。在表式Markov决策问题(MDPs)中,在持续探索和适当参数化下,可以实现全球最佳化。相比之下,在连续的空间,非凝固性构成了一种病理挑战,现有趋同结果主要局限于固定性或任意的局部极端。为了缩小这一差距,我们通过由尾巴-指数参数阿尔法界定的较重尾巴分布定义的政策参数来持续探索空间,这增加了在州空间中跳跃的可能性。在PG通用的得分函数的平滑性条件中,因此,我们确定稳定度的趋同率取决于政策的尾部指数阿尔法、持住者延续性参数、不均匀性条件和首次引入的勘探容忍参数。此外,我们把地方最重的底尾部的尾部的尾部的尾部的尾部参数分为一个更重的依附着于更深层的底部政策,通过一个更深层的升级的升级的升级的升级的升级的升级的流程,通过一个明确的出口和升级的升级的升级的流程,然后确定一个更深层的升级的升级的升级的升级的学习过程的走向。

相关内容

Continuity

关注 4

让 iOS 8 和 OS X Yosemite 无缝切换的一个新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source: Apple - iOS 8

【RLChina2020公开课】Lecture-11.pdf【多智能体学习与游戏AI前沿】

专知会员服务

27+阅读 · 2020年8月6日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

Fariz Darari简明《博弈论Game Theory》介绍，35页ppt

专知会员服务

112+阅读 · 2020年5月15日

深度强化学习策略梯度教程，53页ppt

专知会员服务

184+阅读 · 2020年2月1日