示范无无标准非标准RL:近最佳记录和应用多机构RL和库存控制 (Model-Free Non-Stationary RL: Near-Optimal Regret and Applications in Multi-Agent RL and Inventory Control) - 专知论文

会员服务 ·

0

上置信界限 · 控制器 · Markov · 泛函 · 知识 (knowledge) ·

2022 年 8 月 20 日

Model-Free Non-Stationary RL: Near-Optimal Regret and Applications in Multi-Agent RL and Inventory Control

翻译：示范无无标准非标准RL:近最佳记录和应用多机构RL和库存控制

Weichao Mao,Kaiqing Zhang,Ruihao Zhu,David Simchi-Levi,Tamer Başar

from arxiv, A preliminary version of this work has appeared in ICML 2021

We consider model-free reinforcement learning (RL) in non-stationary Markov decision processes. Both the reward functions and the state transition functions are allowed to vary arbitrarily over time as long as their cumulative variations do not exceed certain variation budgets. We propose Restarted Q-Learning with Upper Confidence Bounds (RestartQ-UCB), the first model-free algorithm for non-stationary RL, and show that it outperforms existing solutions in terms of dynamic regret. Specifically, RestartQ-UCB with Freedman-type bonus terms achieves a dynamic regret bound of $\widetilde{O}(S^{\frac{1}{3}} A^{\frac{1}{3}} \Delta^{\frac{1}{3}} H T^{\frac{2}{3}})$, where $S$ and $A$ are the numbers of states and actions, respectively, $\Delta>0$ is the variation budget, $H$ is the number of time steps per episode, and $T$ is the total number of time steps. We further present a parameter-free algorithm named Double-Restart Q-UCB that does not require prior knowledge of the variation budget. We show that our algorithms are \emph{nearly optimal} by establishing an information-theoretical lower bound of $\Omega(S^{\frac{1}{3}} A^{\frac{1}{3}} \Delta^{\frac{1}{3}} H^{\frac{2}{3}} T^{\frac{2}{3}})$, the first lower bound in non-stationary RL. Numerical experiments validate the advantages of RestartQ-UCB in terms of both cumulative rewards and computational efficiency. We demonstrate the power of our results in examples of multi-agent RL and inventory control across related products.

翻译：我们考虑在非静止的 Markov 决策程序中不使用模型的强化学习 {RL 。只要累积变化不超过某些变化预算, 奖励功能和国家过渡功能可以随时间任意变化。我们提议重新启动高信任库(重新启动Q-UCB), 这是非静止RL的第一个不使用模型的算法, 并显示它以动态的遗憾来比现有的解决方案高。具体地说, 以自由型的奖金条件重新启动 Q- OCB, 实现全方位telde{O}(S\frac{1}3 ⁇ 3} A\frac{N1}3}\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\

0

相关内容

上置信界限

上置信界限

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

【牛津大学】深度残差强化学习，Deep Residual Reinforcement Learning

【牛津大学】深度残差强化学习，Deep Residual Reinforcement Learning

专知会员服务

84+阅读 · 2020年2月18日

UC.Berkeley CS189讲义教材:《机器学习全面指南》，185页pdf

专知会员服务

162+阅读 · 2020年1月16日

【强化学习论文推荐集合】2019年必读的10篇TOP强化学习论文，My Top 10 Deep RL Papers of 2019

【强化学习论文推荐集合】2019年必读的10篇TOP强化学习论文，My Top 10 Deep RL Papers of 2019

专知会员服务

42+阅读 · 2020年1月15日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

181+阅读 · 2019年10月11日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

VCIP 2022 Call for Special Session Proposals

VCIP 2022 Call for Special Session Proposals

CCF多媒体专委会

1+阅读 · 2022年4月1日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium6

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium6

中国图象图形学学会CSIG

2+阅读 · 2021年11月12日

【ICIG2021】Latest News & Announcements of the Industry Talk1

【ICIG2021】Latest News & Announcements of the Industry Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年7月28日

局部学习的特征选择：Local-Learning-Based Feature Selection

局部学习的特征选择：Local-Learning-Based Feature Selection

我爱读PAMI

14+阅读 · 2019年9月20日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

强化学习族谱

强化学习族谱

CreateAMind

26+阅读 · 2017年8月2日

一类离散Hindmarsh-Rose模型的分支延拓

国家自然科学基金

0+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

44+阅读 · 2015年12月31日

基于Morrey空间的函数空间实变理论及其应用

国家自然科学基金

0+阅读 · 2014年12月31日

不确定环境下强化学习和决策的神经机制

国家自然科学基金

11+阅读 · 2012年12月31日

Arisandilactone A 的不对称全合成

国家自然科学基金

0+阅读 · 2012年12月31日

基于Junction tree推理的多运动平台分散式协同导航算法研究

国家自然科学基金

2+阅读 · 2012年12月31日

宿主蛋白Rab家族在IFITMs抑制病毒复制中的作用研究

国家自然科学基金

0+阅读 · 2012年12月31日

有机－金属－配体自组装体系中有序介孔金属有机骨架化合物的合成及性质研究

国家自然科学基金

0+阅读 · 2011年12月31日

城市主干路溢流的形成机理、协调控制及仿真实验研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于过滤器-模式搜索方法的节能发电调度多Agent系统的研究

国家自然科学基金

0+阅读 · 2009年12月31日

Provable General Function Class Representation Learning in Multitask Bandits and MDPs

Arxiv

0+阅读 · 2022年10月6日

Pessimistic Model-based Offline Reinforcement Learning under Partial Coverage

Arxiv

0+阅读 · 2022年10月6日

Reward-Mixing MDPs with a Few Latent Contexts are Learnable

Arxiv

0+阅读 · 2022年10月5日

Policy Gradient for Reinforcement Learning with General Utilities

Arxiv

0+阅读 · 2022年10月3日

Renewable Composite Quantile Method and Algorithm for Nonparametric Models with Streaming Data

Arxiv

0+阅读 · 2022年10月3日

Near-Optimal Deployment Efficiency in Reward-Free Reinforcement Learning with Linear Function Approximation

Arxiv

0+阅读 · 2022年10月3日

Dimension Reduction in Contextual Online Learning via Nonparametric Variable Selection

Arxiv

0+阅读 · 2022年10月3日

Optimistic Policy Optimization is Provably Efficient in Non-stationary MDPs

Arxiv

0+阅读 · 2022年10月2日

A Deep Conjugate Direction Method for Iteratively Solving Linear Systems

Arxiv

0+阅读 · 2022年10月1日

Inverse Online Learning: Understanding Non-Stationary and Reactionary Policies

Arxiv

0+阅读 · 2022年9月30日

VIP会员

文章信息

相关主题

上置信界限

知识 (knowledge)

相关VIP内容

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

【牛津大学】深度残差强化学习，Deep Residual Reinforcement Learning

【牛津大学】深度残差强化学习，Deep Residual Reinforcement Learning

专知会员服务

84+阅读 · 2020年2月18日

UC.Berkeley CS189讲义教材:《机器学习全面指南》，185页pdf

专知会员服务

162+阅读 · 2020年1月16日

【强化学习论文推荐集合】2019年必读的10篇TOP强化学习论文，My Top 10 Deep RL Papers of 2019

【强化学习论文推荐集合】2019年必读的10篇TOP强化学习论文，My Top 10 Deep RL Papers of 2019

专知会员服务

42+阅读 · 2020年1月15日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

181+阅读 · 2019年10月11日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

人工智能治理的未来

模态感知的特征匹配：单一模态与跨模态技术的全面综述

无监督行人重识别研究综述

【牛津博士论文】面向神经影像应用的可扩展且可解释的空间模型

相关资讯

VCIP 2022 Call for Special Session Proposals

VCIP 2022 Call for Special Session Proposals

CCF多媒体专委会

1+阅读 · 2022年4月1日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium6

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium6

中国图象图形学学会CSIG

2+阅读 · 2021年11月12日

【ICIG2021】Latest News & Announcements of the Industry Talk1

【ICIG2021】Latest News & Announcements of the Industry Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年7月28日

局部学习的特征选择：Local-Learning-Based Feature Selection

局部学习的特征选择：Local-Learning-Based Feature Selection

我爱读PAMI

14+阅读 · 2019年9月20日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

强化学习族谱

强化学习族谱

CreateAMind

26+阅读 · 2017年8月2日

相关论文

Provable General Function Class Representation Learning in Multitask Bandits and MDPs

Arxiv

0+阅读 · 2022年10月6日

Pessimistic Model-based Offline Reinforcement Learning under Partial Coverage

Arxiv

0+阅读 · 2022年10月6日

Reward-Mixing MDPs with a Few Latent Contexts are Learnable

Arxiv

0+阅读 · 2022年10月5日

Policy Gradient for Reinforcement Learning with General Utilities

Arxiv

0+阅读 · 2022年10月3日

Renewable Composite Quantile Method and Algorithm for Nonparametric Models with Streaming Data

Arxiv

0+阅读 · 2022年10月3日

Near-Optimal Deployment Efficiency in Reward-Free Reinforcement Learning with Linear Function Approximation

Arxiv

0+阅读 · 2022年10月3日

Dimension Reduction in Contextual Online Learning via Nonparametric Variable Selection

Arxiv

0+阅读 · 2022年10月3日

Optimistic Policy Optimization is Provably Efficient in Non-stationary MDPs

Arxiv

0+阅读 · 2022年10月2日

A Deep Conjugate Direction Method for Iteratively Solving Linear Systems

Arxiv

0+阅读 · 2022年10月1日

Inverse Online Learning: Understanding Non-Stationary and Reactionary Policies

Arxiv

0+阅读 · 2022年9月30日

相关基金

一类离散Hindmarsh-Rose模型的分支延拓

国家自然科学基金

0+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

44+阅读 · 2015年12月31日

基于Morrey空间的函数空间实变理论及其应用

国家自然科学基金

0+阅读 · 2014年12月31日

不确定环境下强化学习和决策的神经机制

国家自然科学基金

11+阅读 · 2012年12月31日

Arisandilactone A 的不对称全合成

国家自然科学基金

0+阅读 · 2012年12月31日

基于Junction tree推理的多运动平台分散式协同导航算法研究

国家自然科学基金

2+阅读 · 2012年12月31日

宿主蛋白Rab家族在IFITMs抑制病毒复制中的作用研究

国家自然科学基金

0+阅读 · 2012年12月31日

有机－金属－配体自组装体系中有序介孔金属有机骨架化合物的合成及性质研究

国家自然科学基金

0+阅读 · 2011年12月31日

城市主干路溢流的形成机理、协调控制及仿真实验研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于过滤器-模式搜索方法的节能发电调度多Agent系统的研究

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员