Modern Reinforcement Learning (RL) is commonly applied to practical problems with an enormous number of states, where function approximation must be deployed to approximate either the value function or the policy. The introduction of function approximation raises a fundamental set of challenges involving computational and statistical efficiency, especially given the need to manage the exploration/exploitation tradeoff. As a result, a core RL question remains open: how can we design provably efficient RL algorithms that incorporate function approximation? This question persists even in a basic setting with linear dynamics and linear rewards, for which only linear function approximation is needed. This paper presents the first provable RL algorithm with both polynomial runtime and polynomial sample complexity in this linear setting, without requiring a "simulator" or additional assumptions. Concretely, we prove that an optimistic modification of Least-Squares Value Iteration (LSVI)---a classical algorithm frequently studied in the linear setting---achieves $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ regret, where $d$ is the ambient dimension of feature space, $H$ is the length of each episode, and $T$ is the total number of steps. Importantly, such regret is independent of the number of states and actions.
翻译:现代强化学习(RL) 通常适用于众多国家的实际问题, 其中功能近似必须用于接近值函数或政策。 引入功能近似会带来一系列基本的挑战, 包括计算和统计效率, 特别是考虑到需要管理勘探/ 开发的权衡。 因此, 核心 RL 问题仍然有待解决: 我们如何设计具有可辨别效率的 RL 算法, 包括函数切合值? 这个问题甚至在一个基本环境中仍然存在, 它只有线性动态和线性奖励, 只需要线性函数近似。 本文展示了在这一线性环境中第一个具有多球运行时间和多球样本复杂性的 RL 算法, 不需要“ 模拟” 或额外的假设。 具体地说, 我们证明最接近值的值迭代( LSVI)- 经典算法经常在线性设置- 达利( $\ titlede) {O} (\\ sqrrrrrent) 中研究过的经典算法 。 问题, $ dad$ 和 minal rivolex a nudeal modeal mode of mate.