Q-learning, which seeks to learn the optimal Q-function of a Markov decision process (MDP) in a model-free fashion, lies at the heart of reinforcement learning. When it comes to the synchronous setting (such that independent samples for all state-action pairs are drawn from a generative model in each iteration), substantial progress has been made towards understanding the sample efficiency of Q-learning. Consider a $\gamma$-discounted infinite-horizon MDP with state space $\mathcal{S}$ and action space $\mathcal{A}$: to yield an entrywise $\varepsilon$-approximation of the optimal Q-function, state-of-the-art theory for Q-learning requires a sample size exceeding the order of $\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^5\varepsilon^{2}}$, which fails to match existing minimax lower bounds. This gives rise to natural questions: what is the sharp sample complexity of Q-learning? Is Q-learning provably sub-optimal? This paper addresses these questions for the synchronous setting: (1) when $|\mathcal{A}|=1$ (so that Q-learning reduces to TD learning), we prove that the sample complexity of TD learning is minimax optimal and scales as $\frac{|\mathcal{S}|}{(1-\gamma)^3\varepsilon^2}$ (up to log factor); (2) when $|\mathcal{A}|\geq 2$, we settle the sample complexity of Q-learning to be on the order of $\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^4\varepsilon^2}$ (up to log factor). Our theory unveils the strict sub-optimality of Q-learning when $|\mathcal{A}|\geq 2$, and rigorizes the negative impact of over-estimation in Q-learning. Finally, we extend our analysis to accommodate asynchronous Q-learning (i.e., the case with Markovian samples), sharpening the horizon dependency of its sample complexity to be $\frac{1}{(1-\gamma)^4}$.
翻译:Q学习是强化学习的核心,这种方法在不需要模型的情况下,试图学习马尔可夫决策过程(MDP)的最优Q函数。对于同步设置(即在每个迭代中独立采样所有状态-动作对的生成模型),在Q学习的样本效率方面已经取得了实质性的进展。考虑具有状态空间$ \mathcal{S} $和动作空间$ \mathcal{A} $的$ \gamma $折现无限地平衡MDP:为了产生最优Q函数的条目的每个$ \varepsilon $近似,对于Q学习的最新理论需要的样本大小超过$ \frac {|\mathcal{S}| | \mathcal{A}|} {( 1- \gamma) ^ 5 \varepsilon ^ {2}}$的顺序,这未能达到现有的极小下限。这引出了自然问题:Q学习的尖锐样本复杂度是多少? Q学习可证明是次优的吗?本文针对同步设置回答了这些问题:(1)当$ | \mathcal {A} | = 1$(使Q学习简化为TD学习)时,我们证明了TD学习的样本复杂度是极小极大的,比例为$ \frac {|\mathcal{S}|} {(1- \gamma)^ 3 \varepsilon ^ 2}$(对数因子); (2)当$ | \mathcal{A} | \geq 2$时,我们解决了Q学习的样本复杂度,其顺序为$ \frac {| \mathcal{S} | | \mathcal{A} |} {(1- \gamma) ^ 4 \varepsilon ^ 2}$(对数因子)。我们的理论揭示了Q学习在$ | \ mathcal {A} | \ geq 2$时的严格次优性,并证明了Q学习中过高估计的负面影响。最后,我们扩展了我们的分析以适应异步Q学习(即带有马尔可夫样本的情况),将其样本复杂度的时间限制精化为$ \frac{1} {(1-\gamma)^ 4}$。