储存系统示范强化学习 (Model-Based Reinforcement Learning for Stochastic Hybrid Systems)

Optimal control of general nonlinear systems is a central challenge in automation. Data-driven approaches to control, enabled by powerful function approximators, have recently had great success in tackling challenging robotic applications. However, such methods often obscure the structure of dynamics and control behind black-box over-parameterized representations, thus limiting our ability to understand the closed-loop behavior. This paper adopts a hybrid-system view of nonlinear modeling and control that lends an explicit hierarchical structure to the problem and breaks down complex dynamics into simpler localized units. Therefore, we consider a sequence modeling paradigm that captures the temporal structure of the data and derive an expecation-maximization (EM) algorithm that automatically decomposes nonlinear dynamics into stochastic piecewise affine dynamical systems with nonlinear boundaries. Furthermore, we show that these time-series models naturally admit a closed-loop extension that we use to extract locally linear or polynomial feedback controllers from nonlinear experts via imitation learning. Finally, we introduce a novel hybrid realtive entropy policy search (Hb-REPS) technique that incorporates the hierarchical nature of hybrid systems and optimizes a set of time-invariant local feedback controllers derived from a locally polynomial approximation of a global value function.

翻译：对普通非线性系统的优化控制是自动化的一个中心挑战。数据驱动的控制方法在强大的功能近似器的帮助下,最近在应对具有挑战性的机器人应用方面取得了巨大的成功。但是,这种方法往往模糊黑盒超分度表示器背后的动态和控制结构,从而限制了我们理解闭环行为的能力。本文采用了非线性模型和控制的混合系统视图,这为问题提供了明确的上层结构,并将复杂的动态转换成更简单的本地化单位。因此,我们考虑了一种以数据的时间结构为依托的序列模型模式,并得出了一种显微(EM)算法,该算法将非线性动态自动分解成具有非线性细线性、具有非线性边际界限的细线性结构,从而限制了我们理解闭线性模型和控制的能力。此外,我们展示了这些时间序列模型自然会接受一种闭环扩展,我们用来通过模仿学习从非线性专家那里提取线性或多线性反馈控制器。最后,我们引入了一种新型的混合真实性政策搜索(Hb-REP- Meximalimalimalimal pressal mess 一种从当地最佳混合系统中提取了一种由地方级级级级级数级数级数级数级的系统。