In decision making problems for continuous state and action spaces, linear dynamical models are widely employed. Specifically, policies for stochastic linear systems subject to quadratic cost functions capture a large number of applications in reinforcement learning. Selected randomized policies have been studied in the literature recently that address the trade-off between identification and control. However, little is known about policies based on bootstrapping observed states and actions. In this work, we show that bootstrap-based policies achieve a square root scaling of regret with respect to time. We also obtain results on the accuracy of learning the model's dynamics. Corroborative numerical analysis that illustrates the technical results is also provided.
翻译:在持续状态和行动空间的决策问题方面,广泛采用线性动态模型,具体来说,具有二次成本功能的随机线性系统政策在强化学习中有大量应用。最近,在论述识别和控制权衡的文献中研究了选定的随机化政策。然而,对基于所观察到的靴子状态和行动的政策知之甚少。在这项工作中,我们表明以靴子为主的政策在时间方面有平方根的遗憾。我们还获得了关于学习模型动态的准确性的结果。还提供了说明技术结果的逻辑数字分析。