We propose a novel approach for planning agents to compose abstract skills via observing and learning from historical interactions with the world. Our framework operates in a Markov state-space model via a set of actions under unknown pre-conditions. We formulate skills as high-level abstract policies that propose action plans based on the current state. Each policy learns new plans by observing the states' transitions while the agent interacts with the world. Such an approach automatically learns new plans to achieve specific intended effects, but the success of such plans is often dependent on the states in which they are applicable. Therefore, we formulate the evaluation of such plans as infinitely many multi-armed bandit problems, where we balance the allocation of resources on evaluating the success probability of existing arms and exploring new options. The result is a planner capable of automatically learning robust high-level skills under a noisy environment; such skills implicitly learn the action pre-condition without explicit knowledge. We show that this planning approach is experimentally very competitive in high-dimensional state space domains.
翻译:我们为规划人员提出了一个新颖的方法,通过观察和学习与世界的历史互动来形成抽象的技能。我们的框架通过一套未知的预设条件下的行动,以马尔科夫州-空间模式运作。我们把技能作为高级别的抽象政策,根据现状提出行动计划;每项政策通过观察各州的过渡而学习新的计划,而代理人则与世界互动。这样一种方法自动地学习实现特定预期效果的新计划,但这种计划的成功往往取决于其适用的州。因此,我们制定这类计划的评价,如无数多臂强盗问题,在评估现有武器成功概率和探索新选项时平衡资源的分配。结果是一个规划人员,能够在噪音环境中自动学习稳健的高级技能;这种技能隐含地学习行动预设条件,而没有明确的知识。我们表明,这种规划方法在高维空间领域实验性地非常有竞争力。