In recent years, researchers have made significant progress in devising reinforcement-learning algorithms for optimizing linear temporal logic (LTL) objectives and LTL-like objectives. Despite these advancements, there are fundamental limitations to how well this problem can be solved. Previous studies have alluded to this fact but have not examined it in depth. In this paper, we address the tractability of reinforcement learning for general LTL objectives from a theoretical perspective. We formalize the problem under the probably approximately correct learning in Markov decision processes (PAC-MDP) framework, a standard framework for measuring sample complexity in reinforcement learning. In this formalization, we prove that the optimal policy for any LTL formula is PAC-MDP-learnable if and only if the formula is in the most limited class in the LTL hierarchy, consisting of formulas that are decidable within a finite horizon. Practically, our result implies that it is impossible for a reinforcement-learning algorithm to obtain a PAC-MDP guarantee on the performance of its learned policy after finitely many interactions with an unconstrained environment for LTL objectives that are not decidable within a finite horizon.
翻译:近年来,研究人员在设计优化线性时间逻辑(LTL)目标和LTL类似目标的强化学习算法方面取得了重大进展。尽管取得了这些进步,但对于这一问题的解决有多好,仍然存在着根本性的局限性。以前的研究已经提到这一事实,但没有对此进行深入的研究。在本文件中,我们从理论角度探讨通用LT目标强化学习的可感性。我们在Markov决策程序(PAC-MDP)框架(衡量强化学习的抽样复杂性的标准框架)中将问题正式化。在这个正式化中,我们证明任何LTL公式的最佳政策是PAC-MDP-leanern,如果而且只有在LTLT等级中公式属于最有限的类别,包括可在有限范围内消减的公式。实际上,我们的结果意味着,在与LTLT目标的不受限制的环境进行有限的许多互动之后,不可能获得PAC-MDP对其学习政策的绩效的保证。