Uncertainty estimation for Reinforcement Learning (RL) is a critical component in control tasks where agents must balance safe exploration and efficient learning. While deep neural networks have enabled breakthroughs in RL, they often lack calibrated uncertainty estimates. We introduce Deep Gaussian Process Proximal Policy Optimization (GPPO), a scalable, model-free actor-critic algorithm that leverages Deep Gaussian Processes (DGPs) to approximate both the policy and value function. GPPO maintains competitive performance with respect to Proximal Policy Optimization on standard high-dimensional continuous control benchmarks while providing well-calibrated uncertainty estimates that can inform safer and more effective exploration.
翻译:强化学习中的不确定性估计是控制任务中的关键组成部分,在这些任务中智能体必须在安全探索与高效学习之间取得平衡。尽管深度神经网络已在强化学习中取得突破性进展,但其通常缺乏经过校准的不确定性估计。本文提出深度高斯过程近端策略优化,这是一种可扩展的、无模型的行动者-评论家算法,它利用深度高斯过程来近似策略函数与价值函数。该算法在标准高维连续控制基准测试中保持了与近端策略优化相竞争的性能,同时提供了经过良好校准的不确定性估计,这些估计可为更安全、更有效的探索提供信息。