参数MDP和强化学习问题 -- -- 以最大成份基于原则的框架 (Parameterized MDPs and Reinforcement Learning Problems -- A Maximum Entropy Principle Based Framework)

We present a framework to address a class of sequential decision making problems. Our framework features learning the optimal control policy with robustness to noisy data, determining the unknown state and action parameters, and performing sensitivity analysis with respect to problem parameters. We consider two broad categories of sequential decision making problems modelled as infinite horizon Markov Decision Processes (MDPs) with (and without) an absorbing state. The central idea underlying our framework is to quantify exploration in terms of the Shannon Entropy of the trajectories under the MDP and determine the stochastic policy that maximizes it while guaranteeing a low value of the expected cost along a trajectory. This resulting policy enhances the quality of exploration early on in the learning process, and consequently allows faster convergence rates and robust solutions even in the presence of noisy data as demonstrated in our comparisons to popular algorithms such as Q-learning, Double Q-learning and entropy regularized Soft Q-learning. The framework extends to the class of parameterized MDP and RL problems, where states and actions are parameter dependent, and the objective is to determine the optimal parameters along with the corresponding optimal policy. Here, the associated cost function can possibly be non-convex with multiple poor local minima. Simulation results applied to a 5G small cell network problem demonstrate successful determination of communication routes and the small cell locations. We also obtain sensitivity measures to problem parameters and robustness to noisy environment data.

翻译：我们提出的框架是处理一系列顺序决策问题的框架。我们的框架特点是学习最优的控制政策,稳健地学习吵闹的数据,确定未知的状态和行动参数,并对问题参数进行敏感性分析。我们考虑两种广泛的顺序决策问题,一种是具有(和没有)吸收状态的无限地平地马可夫决策程序(MDPs),一种是利用(和没有)吸收状态的无限地平线马尔科夫决策程序(MDPs)来模拟。我们框架的核心思想是用“香农轨道”来量化探索MDP下轨道的探索,并确定尽可能扩大该轨道预期成本价值的随机政策,同时保证该轨道上预期成本的低值。由此形成的政策提高了在学习过程中早期进行探索的质量,从而使得更快的趋同率和稳健的解决方案得以实现,即使在我们与流行的算法(例如Q-学习、双Q-学习和加密软化的Soft Q-学习)进行比较时所显示的烦琐度。框架延伸到参数化的MDP和RL问题类别,而国家和行动又取决于参数,其目标则是确定与相应的最佳政策的最佳参数。在这里,同时确定与相应的最佳政策的最佳参数,因此,相关的网络成本功能可能不成功的模缩缩缩缩缩成本。