亚线强化学习减少差异方法 (Variance Reduction Methods for Sublinear Reinforcement Learning)

This work considers the problem of provably optimal reinforcement learning for episodic finite horizon MDPs, i.e. how an agent learns to maximize his/her long term reward in an uncertain environment. The main contribution is in providing a novel algorithm --- Variance-reduced Upper Confidence Q-learning (vUCQ) --- which enjoys a regret bound of $\widetilde{O}(\sqrt{HSAT} + H^5SA)$, where the $T$ is the number of time steps the agent acts in the MDP, $S$ is the number of states, $A$ is the number of actions, and $H$ is the (episodic) horizon time. This is the first regret bound that is both sub-linear in the model size and asymptotically optimal. The algorithm is sub-linear in that the time to achieve $\epsilon$-average regret for any constant $\epsilon$ is $O(SA)$, which is a number of samples that is far less than that required to learn any non-trivial estimate of the transition model (the transition model is specified by $O(S^2A)$ parameters). The importance of sub-linear algorithms is largely the motivation for algorithms such as $Q$-learning and other "model free" approaches. vUCQ algorithm also enjoys minimax optimal regret in the long run, matching the $\Omega(\sqrt{HSAT})$ lower bound. Variance-reduced Upper Confidence Q-learning (vUCQ) is a successive refinement method in which the algorithm reduces the variance in $Q$-value estimates and couples this estimation scheme with an upper confidence based algorithm. Technically, the coupling of both of these techniques is what leads to the algorithm enjoying both the sub-linear regret property and the asymptotically optimal regret.

翻译：这项工作考虑到一个问题,即对超常的有限地平线 MDP 进行最优化的强化学习, 也就是说, 一个代理商如何学会在不确定的环境中最大限度地增加其长期报酬。主要贡献在于提供新的算法 -- -- 差异- 差异- 差异- 降低高信任 Q学习( vUCQ) -- -- 享有美元全局[ O} (\ sqrt{HSAT} + H% 5SA) 的遗憾。美元是代理商在 MDP 中行动的时间步骤的数量, 美元是州的数量, 美元是行动的数量, 美元是行动的数量, 美元是( 快速) 前景。这是第一个遗憾, 在模型中, 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币- 货币-