采用基于渐进的政策优化政策学习不当 (Improper Learning with Gradient-based Policy Optimization) - 专知论文

会员服务 ·

0

优化器 · 学成 · 基 · 价值函数 · 控制器 ·

2021 年 2 月 16 日

Improper Learning with Gradient-based Policy Optimization

翻译：采用基于渐进的政策优化政策学习不当

Mohammadi Zaki,Avinash Mohan,Aditya Gopalan,Shie Mannor

We consider an improper reinforcement learning setting where the learner is given M base controllers for an unknown Markov Decision Process, and wishes to combine them optimally to produce a potentially new controller that can outperform each of the base ones. We propose a gradient-based approach that operates over a class of improper mixtures of the controllers. The value function of the mixture and its gradient may not be available in closed-form; however, we show that we can employ rollouts and simultaneous perturbation stochastic approximation (SPSA) for explicit gradient descent optimization. We derive convergence and convergence rate guarantees for the approach assuming access to a gradient oracle. Numerical results on a challenging constrained queueing task show that our improper policy optimization algorithm can stabilize the system even when each constituent policy at its disposal is unstable.

翻译：我们考虑一种不适当的强化学习环境,让学习者获得M基控制器,用于一个未知的马尔科夫决定程序,我们希望将它们最佳地结合起来,以产生一个可能的新控制器,其性能优于每个基数。我们建议一种基于梯度的方法,在控制器的某一类不适当的混合物上运行。混合物及其梯度的价值功能可能无法以闭式形式提供;然而,我们表明,我们可以使用推出和同时的扰动近似近似(SPSA)来进行明确的梯度下降优化。我们为假设使用梯度或电弧的方法获得趋同率和趋同率保证。具有挑战性的排队限制任务的数字结果显示,我们不当的政策优化算法可以稳定系统,即使其使用的每个组成政策都不稳定。

0

相关内容

优化器

【KDD2020-Tutorial】因果推理与稳定学习，Causal Inference and Stable Learning

【KDD2020-Tutorial】因果推理与稳定学习，Causal Inference and Stable Learning

专知会员服务

87+阅读 · 2020年8月28日

【伯克利-Ke Li】学习优化，74页ppt，Learning to Optimize

【伯克利-Ke Li】学习优化，74页ppt，Learning to Optimize

专知会员服务

41+阅读 · 2020年7月23日

【牛津大学】深度残差强化学习，Deep Residual Reinforcement Learning

【牛津大学】深度残差强化学习，Deep Residual Reinforcement Learning

专知会员服务

84+阅读 · 2020年2月18日

【斯坦福大学】Gradient Surgery for Multi-Task Learning

【斯坦福大学】Gradient Surgery for Multi-Task Learning

专知会员服务

47+阅读 · 2020年1月23日

【Google 76分钟训练万BERT最新论文】Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

【Google 76分钟训练万BERT最新论文】Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

专知会员服务

4+阅读 · 2020年1月7日

【机器学习基础最新版】（Mathematics for Machine Learning），417页pdf

【机器学习基础最新版】（Mathematics for Machine Learning），417页pdf

专知会员服务

244+阅读 · 2019年10月21日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

专知会员服务

83+阅读 · 2019年10月9日

已删除

将门创投

7+阅读 · 2019年10月10日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Inverse Reinforcement Learning a Control Lyapunov Approach

Arxiv

1+阅读 · 2021年4月9日

Learning Sampling Policy for Faster Derivative Free Optimization

Learning Sampling Policy for Faster Derivative Free Optimization

Arxiv

0+阅读 · 2021年4月9日

Model-based Adversarial Meta-Reinforcement Learning

Arxiv

5+阅读 · 2020年6月16日

Meta-Learning with Implicit Gradients

Meta-Learning with Implicit Gradients

Arxiv

13+阅读 · 2019年9月10日

PPO-CMA: Proximal Policy Optimization with Covariance Matrix Adaptation

PPO-CMA: Proximal Policy Optimization with Covariance Matrix Adaptation

Arxiv

8+阅读 · 2018年12月18日

Residual Policy Learning

Residual Policy Learning

Arxiv

4+阅读 · 2018年12月15日

Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks

Arxiv

8+阅读 · 2018年11月21日

Reinforcement Learning for Solving the Vehicle Routing Problem

Arxiv

3+阅读 · 2018年5月21日

Feature-Based Aggregation and Deep Reinforcement Learning: A Survey and Some New Implementations

Arxiv

9+阅读 · 2018年4月22日

Improved Image Captioning via Policy Gradient optimization of SPIDEr

Arxiv

6+阅读 · 2018年3月12日

VIP会员

文章信息

相关主题

相关VIP内容

【KDD2020-Tutorial】因果推理与稳定学习，Causal Inference and Stable Learning

【KDD2020-Tutorial】因果推理与稳定学习，Causal Inference and Stable Learning

专知会员服务

87+阅读 · 2020年8月28日

【伯克利-Ke Li】学习优化，74页ppt，Learning to Optimize

【伯克利-Ke Li】学习优化，74页ppt，Learning to Optimize

专知会员服务

41+阅读 · 2020年7月23日

【牛津大学】深度残差强化学习，Deep Residual Reinforcement Learning

【牛津大学】深度残差强化学习，Deep Residual Reinforcement Learning

专知会员服务

84+阅读 · 2020年2月18日

【斯坦福大学】Gradient Surgery for Multi-Task Learning

【斯坦福大学】Gradient Surgery for Multi-Task Learning

专知会员服务

47+阅读 · 2020年1月23日

【Google 76分钟训练万BERT最新论文】Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

【Google 76分钟训练万BERT最新论文】Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

专知会员服务

4+阅读 · 2020年1月7日

【机器学习基础最新版】（Mathematics for Machine Learning），417页pdf

【机器学习基础最新版】（Mathematics for Machine Learning），417页pdf

专知会员服务

244+阅读 · 2019年10月21日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

专知会员服务

83+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

新型数字杀伤链：理解综合战术网络对野战炮兵体系的能力与效益

《对抗环境中运用数字孪生技术优化预测性维护与后勤保障》2025最新93页

《任务式指挥十六个案例研究》232页

《幻觉还是事实：国防大型语言模型的可信度评估研究》2025最新109页

相关资讯

已删除

将门创投

7+阅读 · 2019年10月10日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

相关论文

Inverse Reinforcement Learning a Control Lyapunov Approach

Arxiv

1+阅读 · 2021年4月9日

Learning Sampling Policy for Faster Derivative Free Optimization

Learning Sampling Policy for Faster Derivative Free Optimization

Arxiv

0+阅读 · 2021年4月9日

Model-based Adversarial Meta-Reinforcement Learning

Arxiv

5+阅读 · 2020年6月16日

Meta-Learning with Implicit Gradients

Meta-Learning with Implicit Gradients

Arxiv

13+阅读 · 2019年9月10日

PPO-CMA: Proximal Policy Optimization with Covariance Matrix Adaptation

PPO-CMA: Proximal Policy Optimization with Covariance Matrix Adaptation

Arxiv

8+阅读 · 2018年12月18日

Residual Policy Learning

Residual Policy Learning

Arxiv

4+阅读 · 2018年12月15日

Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks

Arxiv

8+阅读 · 2018年11月21日

Reinforcement Learning for Solving the Vehicle Routing Problem

Arxiv

3+阅读 · 2018年5月21日

Feature-Based Aggregation and Deep Reinforcement Learning: A Survey and Some New Implementations

Arxiv

9+阅读 · 2018年4月22日

Improved Image Captioning via Policy Gradient optimization of SPIDEr

Arxiv

6+阅读 · 2018年3月12日

微信扫码咨询专知VIP会员