In this paper, we establish a subgame perfect equilibrium reinforcement learning (SPERL) framework for time-inconsistent (TIC) problems. In the context of RL, TIC problems are known to face two main challenges: the non-existence of natural recursive relationships between value functions at different time points and the violation of Bellman's principle of optimality that raises questions on the applicability of standard policy iteration algorithms for unprovable policy improvement theorems. We adapt an extended dynamic programming theory and propose a new class of algorithms, called backward policy iteration (BPI), that solves SPERL and addresses both challenges. To demonstrate the practical usage of BPI as a training framework, we adapt standard RL simulation methods and derive two BPI-based training algorithms. We examine our derived training frameworks on a mean-variance portfolio selection problem and evaluate some performance metrics including convergence and model identifiability.
翻译:在本文中,我们为时间不相容(TIC)问题建立了一个子游戏的完全平衡强化学习框架(SPERL),在RL方面,众所周知,东京问题面临两个主要挑战:不同时间点的价值功能之间不存在自然的再生关系,违反Bellman的最佳原则,这就对标准政策迭代算法对无法预测的政策改进理论的适用性提出了问题。我们改编了一个扩展的动态编程理论,并提出了新的算法类别,称为后向政策迭代法(BPI),解决了SPERL,并应对了这两个挑战。为了证明BPI作为一个培训框架的实际用途,我们调整了标准RL模拟方法,并得出了两种基于BPI的培训算法。我们研究了我们从中得出的关于中差组合选择问题的培训框架,并评估了一些绩效衡量标准,包括趋同和模型可识别性。