Deep Q-learning based algorithms have been applied successfully in many decision making problems, while their theoretical foundations are not as well understood. In this paper, we study a Fitted Q-Iteration with two-layer ReLU neural network parameterization, and find the sample complexity guarantees for the algorithm. Our approach estimates the Q-function in each iteration using a convex optimization problem. We show that this approach achieves a sample complexity of $\tilde{\mathcal{O}}(1/\epsilon^{2})$, which is order-optimal. This result holds for a countable state-spaces and does not require any assumptions such as a linear or low rank structure on the MDP.
翻译:深Q- 学习的算法已成功地应用于许多决策问题, 而它们的理论基础却不那么清楚。 在本文中, 我们研究一个配有双层 ReLU 神经网络参数的匹配的 Q- 扩展, 并找到算法的样本复杂性保证。 我们的方法使用一个二次曲线优化问题来估计每次迭代中的 Q 功能 。 我们显示, 这种方法达到了 $\ tilde\ mathcal{ O} (1/\\\ epsilon\\ }2}) 的样本复杂性, 也就是排序最佳的 $ 。 这样的结果可以维持一个可计算的状态空间, 不需要任何假设, 如 MDP 的线性或低级结构 。