向全球最佳回报率上升的后退幅度 (Reward-Weighted Regression Converges to a Global Optimum) - 专知论文

会员服务 ·

0

对数似然 · 泛函 · 学成 · CASE · 优化器 ·

2021 年 10 月 15 日

Reward-Weighted Regression Converges to a Global Optimum

翻译：向全球最佳回报率上升的后退幅度

Miroslav Štrupl,Francesco Faccio,Dylan R. Ashley,Rupesh Kumar Srivastava,Jürgen Schmidhuber

from arxiv, 7 pages in main text + 2 pages of references + 6 pages of appendices, 1 figure in main text + 1 figure in appendices; source code available at https://github.com/dylanashley/reward-weighted-regression

Reward-Weighted Regression (RWR) belongs to a family of widely known iterative Reinforcement Learning algorithms based on the Expectation-Maximization framework. In this family, learning at each iteration consists of sampling a batch of trajectories using the current policy and fitting a new policy to maximize a return-weighted log-likelihood of actions. Although RWR is known to yield monotonic improvement of the policy under certain circumstances, whether and under which conditions RWR converges to the optimal policy have remained open questions. In this paper, we provide for the first time a proof that RWR converges to a global optimum when no function approximation is used, in a general compact setting. Furthermore, for the simpler case with finite state and action spaces we prove R-linear convergence of the state-value function to the optimum.

翻译：重负回归(RWR) 属于一个以期望-最大化框架为基础的广为人知的迭代强化学习算法大家庭。在这个大家庭中,每次迭代学习都包括利用现行政策抽样一组轨迹,并采用新政策最大限度地实现回报加权日志行动。虽然据知RWR在某些情况下会产生政策单一的改进,但RWR是否和在何种条件下与最佳政策趋同仍然是开放的问题。在本文中,我们首次提供了证据,证明RWR在没有使用函数近似值的情况下,在一般契约环境下,RWR会与全球最佳一致。此外,对于使用有限状态和行动空间的简单案例,我们证明国家价值功能与最佳状态的R线性趋同。

0

相关内容

对数似然

【经典书】计算最优传输，209页pdf，Computational Optimal Transport

【经典书】计算最优传输，209页pdf，Computational Optimal Transport

专知会员服务

75+阅读 · 2021年1月10日

INRIA 最新《机器学习理论》课程笔记，176页pdf

专知会员服务

51+阅读 · 2020年12月14日

【普林斯顿大学-微软】加权元学习，Weighted Meta-Learning

【普林斯顿大学-微软】加权元学习，Weighted Meta-Learning

专知会员服务

40+阅读 · 2020年3月25日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

最大均方差正则化贝叶斯神经网络，Bayesian Neural Networks With Maximum Mean Discrepancy Regularization

最大均方差正则化贝叶斯神经网络，Bayesian Neural Networks With Maximum Mean Discrepancy Regularization

专知会员服务

54+阅读 · 2020年3月5日

【牛津大学ICLR2020】通过元学习的贝叶斯自适应深度RL, VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning

【牛津大学ICLR2020】通过元学习的贝叶斯自适应深度RL, VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning

专知会员服务

25+阅读 · 2020年2月28日

深度强化学习策略梯度教程，53页ppt

深度强化学习策略梯度教程，53页ppt

专知会员服务

184+阅读 · 2020年2月1日

Risk Sensitive Portfolio Optimization with Regime-Switching and Default Contagion，香港理工大学应用数学系余翔助理教授，第八届全国社会媒体处理大会SMP2019

Risk Sensitive Portfolio Optimization with Regime-Switching and Default Contagion，香港理工大学应用数学系余翔助理教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

10+阅读 · 2019年10月24日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

181+阅读 · 2019年10月11日

已删除

将门创投

8+阅读 · 2019年6月13日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Hierarchical Imitation - Reinforcement Learning

Hierarchical Imitation - Reinforcement Learning

CreateAMind

19+阅读 · 2018年5月25日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

【推荐】决策树/随机森林深入解析

【推荐】决策树/随机森林深入解析

机器学习研究会

5+阅读 · 2017年9月21日

强化学习族谱

强化学习族谱

CreateAMind

26+阅读 · 2017年8月2日

强化学习 cartpole_a3c

强化学习 cartpole_a3c

CreateAMind

9+阅读 · 2017年7月21日

Minimax Regret for Stochastic Shortest Path

Arxiv

0+阅读 · 2021年12月9日

An improper estimator with optimal excess risk in misspecified density estimation and logistic regression

Arxiv

0+阅读 · 2021年12月8日

Entropy Regularization for Mean Field Games with Learning

Arxiv

0+阅读 · 2021年12月8日

Convergence Results For Q-Learning With Experience Replay

Arxiv

0+阅读 · 2021年12月8日

Aggregation of Pareto optimal models

Arxiv

0+阅读 · 2021年12月8日

Online estimation and control with optimal pathlength regret

Arxiv

0+阅读 · 2021年12月7日

First-Order Regret in Reinforcement Learning with Linear Function Approximation: A Robust Estimation Approach

Arxiv

0+阅读 · 2021年12月7日

Gradient play in stochastic games: stationary points, convergence, and sample complexity

Arxiv

0+阅读 · 2021年12月6日

Variable Selection in Regression-based Estimation of Dynamic Treatment Regimes

Arxiv

0+阅读 · 2021年12月3日

Stochastic Shortest Path: Minimax, Parameter-Free and Towards Horizon-Free Regret

Arxiv

8+阅读 · 2021年4月22日

VIP会员

文章信息

相关主题

相关VIP内容

【经典书】计算最优传输，209页pdf，Computational Optimal Transport

【经典书】计算最优传输，209页pdf，Computational Optimal Transport

专知会员服务

75+阅读 · 2021年1月10日

INRIA 最新《机器学习理论》课程笔记，176页pdf

专知会员服务

51+阅读 · 2020年12月14日

【普林斯顿大学-微软】加权元学习，Weighted Meta-Learning

【普林斯顿大学-微软】加权元学习，Weighted Meta-Learning

专知会员服务

40+阅读 · 2020年3月25日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

最大均方差正则化贝叶斯神经网络，Bayesian Neural Networks With Maximum Mean Discrepancy Regularization

最大均方差正则化贝叶斯神经网络，Bayesian Neural Networks With Maximum Mean Discrepancy Regularization

专知会员服务

54+阅读 · 2020年3月5日

【牛津大学ICLR2020】通过元学习的贝叶斯自适应深度RL, VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning

【牛津大学ICLR2020】通过元学习的贝叶斯自适应深度RL, VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning

专知会员服务

25+阅读 · 2020年2月28日

深度强化学习策略梯度教程，53页ppt

深度强化学习策略梯度教程，53页ppt

专知会员服务

184+阅读 · 2020年2月1日

Risk Sensitive Portfolio Optimization with Regime-Switching and Default Contagion，香港理工大学应用数学系余翔助理教授，第八届全国社会媒体处理大会SMP2019

Risk Sensitive Portfolio Optimization with Regime-Switching and Default Contagion，香港理工大学应用数学系余翔助理教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

10+阅读 · 2019年10月24日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

181+阅读 · 2019年10月11日

热门VIP内容

开通专知VIP会员享更多权益服务

人工智能治理的未来

模态感知的特征匹配：单一模态与跨模态技术的全面综述

无监督行人重识别研究综述

【牛津博士论文】面向神经影像应用的可扩展且可解释的空间模型

相关资讯

已删除

将门创投

8+阅读 · 2019年6月13日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Hierarchical Imitation - Reinforcement Learning

Hierarchical Imitation - Reinforcement Learning

CreateAMind

19+阅读 · 2018年5月25日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

【推荐】决策树/随机森林深入解析

【推荐】决策树/随机森林深入解析

机器学习研究会

5+阅读 · 2017年9月21日

强化学习族谱

强化学习族谱

CreateAMind

26+阅读 · 2017年8月2日

强化学习 cartpole_a3c

强化学习 cartpole_a3c

CreateAMind

9+阅读 · 2017年7月21日

相关论文

Minimax Regret for Stochastic Shortest Path

Arxiv

0+阅读 · 2021年12月9日

An improper estimator with optimal excess risk in misspecified density estimation and logistic regression

Arxiv

0+阅读 · 2021年12月8日

Entropy Regularization for Mean Field Games with Learning

Arxiv

0+阅读 · 2021年12月8日

Convergence Results For Q-Learning With Experience Replay

Arxiv

0+阅读 · 2021年12月8日

Aggregation of Pareto optimal models

Arxiv

0+阅读 · 2021年12月8日

Online estimation and control with optimal pathlength regret

Arxiv

0+阅读 · 2021年12月7日

First-Order Regret in Reinforcement Learning with Linear Function Approximation: A Robust Estimation Approach

Arxiv

0+阅读 · 2021年12月7日

Gradient play in stochastic games: stationary points, convergence, and sample complexity

Arxiv

0+阅读 · 2021年12月6日

Variable Selection in Regression-based Estimation of Dynamic Treatment Regimes

Arxiv

0+阅读 · 2021年12月3日

Stochastic Shortest Path: Minimax, Parameter-Free and Towards Horizon-Free Regret

Arxiv

8+阅读 · 2021年4月22日

微信扫码咨询专知VIP会员