多任务加强学习中不断演进的等级级记忆感应器 (Evolving Hierarchical Memory-Prediction Machines in Multi-Task Reinforcement Learning)

A fundamental aspect of behaviour is the ability to encode salient features of experience in memory and use these memories, in combination with current sensory information, to predict the best action for each situation such that long-term objectives are maximized. The world is highly dynamic, and behavioural agents must generalize across a variety of environments and objectives over time. This scenario can be modeled as a partially-observable multi-task reinforcement learning problem. We use genetic programming to evolve highly-generalized agents capable of operating in six unique environments from the control literature, including OpenAI's entire Classic Control suite. This requires the agent to support discrete and continuous actions simultaneously. No task-identification sensor inputs are provided, thus agents must identify tasks from the dynamics of state variables alone and define control policies for each task. We show that emergent hierarchical structure in the evolving programs leads to multi-task agents that succeed by performing a temporal decomposition and encoding of the problem environments in memory. The resulting agents are competitive with task-specific agents in all six environments. Furthermore, the hierarchical structure of programs allows for dynamic run-time complexity, which results in relatively efficient operation.

翻译：行为的一个基本方面是能够将记忆经验的显著特征编码起来,并利用这些记忆,同时结合目前的感官信息,预测每种情况的最佳行动,从而最大限度地实现长期目标。世界是高度动态的,行为主体必须随时间推移而广泛分布于各种环境和目标中。这种情景可以模拟成一个部分可观测的多任务强化学习问题。我们利用基因程序来发展高度普及的制剂,能够从控制文献的六个独特环境中运作,包括OpenAI的整个经典控制套件。这要求代理同时支持离散和连续的行动。没有提供任务识别传感器投入,因此代理必须从国家变量的动态中单独确定任务,并为每项任务确定控制政策。我们表明,不断演变的方案中出现的等级结构会导致多任务主体,通过对记忆中的问题环境进行时间分解和编码而取得成功。由此产生的代理与所有六个环境中的特定任务代理商具有竞争力。此外,程序分级结构允许动态的运行时间复杂性,其结果是相对有效的操作。