Policy gradient methods, which have been extensively studied in the last decade, offer an effective and efficient framework for reinforcement learning problems. However, their performances can often be unsatisfactory, suffering from unreliable reward improvements and slow convergence, due to high variance in gradient estimations. In this paper, we propose a universal reward profiling framework that can be seamlessly integrated with any policy gradient algorithm, where we selectively update the policy based on high-confidence performance estimations. We theoretically justify that our technique will not slow down the convergence of the baseline policy gradient methods, but with high probability, will result in stable and monotonic improvements of their performance. Empirically, on eight continuous-control benchmarks (Box2D and MuJoCo/PyBullet), our profiling yields up to 1.5x faster convergence to near-optimal returns, up to 1.75x reduction in return variance on some setups. Our profiling approach offers a general, theoretically grounded path to more reliable and efficient policy learning in complex environments.
翻译:策略梯度方法在过去十年中得到广泛研究,为强化学习问题提供了一个高效且有效的框架。然而,由于梯度估计的高方差,其性能往往不尽如人意,表现为奖励改进不可靠和收敛速度缓慢。本文提出一种通用的奖励剖析框架,可无缝集成到任何策略梯度算法中,其中我们基于高置信度的性能估计有选择性地更新策略。我们从理论上证明,该技术不会减缓基线策略梯度方法的收敛速度,且以高概率实现其性能的稳定单调提升。在八个连续控制基准测试(Box2D和MuJoCo/PyBullet)上的实验表明,我们的剖析方法使收敛至接近最优回报的速度提升最高达1.5倍,在某些设置下回报方差降低最高达1.75倍。该剖析方法为复杂环境中实现更可靠、更高效的策略学习提供了一条具有理论依据的通用路径。