The ability to discover useful behaviours from past experience and transfer them to new tasks is considered a core component of natural embodied intelligence. Inspired by neuroscience, discovering behaviours that switch at bottleneck states have been long sought after for inducing plans of minimum description length across tasks. Prior approaches have either only supported online, on-policy, bottleneck state discovery, limiting sample-efficiency, or discrete state-action domains, restricting applicability. To address this, we introduce Model-Based Offline Options (MO2), an offline hindsight framework supporting sample-efficient bottleneck option discovery over continuous state-action spaces. Once bottleneck options are learnt offline over source domains, they are transferred online to improve exploration and value estimation on the transfer domain. Our experiments show that on complex long-horizon continuous control tasks with sparse, delayed rewards, MO2's properties are essential and lead to performance exceeding recent option learning methods. Additional ablations further demonstrate the impact on option predictability and credit assignment.
翻译:从过去的经验中发现有用行为并将其转移到新任务的能力被认为是自然体现的情报的核心组成部分。在神经科学的启发下,在瓶颈国家发现改变的行为,在诱导各项任务的最低描述长度计划之后,长期以来一直寻求在瓶颈国家发现的行为。 以往的做法要么只支持在线、政策上、瓶颈状态发现,限制抽样效率,或者分散的州行动领域,限制适用性。为了解决这个问题,我们引入了基于模型的离线离线后视框架(MO2),支持在连续的州行动空间中发现抽样高效的瓶颈选项。一旦在源域上从离线学习了瓶颈选项,这些选项就会被在线传输,以改善对转移域的勘探和价值估计。我们的实验显示,在复杂长方位连续控制任务上,有稀有、延迟的回报,MO2的特性至关重要,并导致业绩超过最近的选项学习方法。其他列表进一步证明了对选项可预测性和信用分配的影响。