Humans can leverage prior experience and learn novel tasks from a handful of demonstrations. In contrast to offline meta-reinforcement learning, which aims to achieve quick adaptation through better algorithm design, we investigate the effect of architecture inductive bias on the few-shot learning capability. We propose a Prompt-based Decision Transformer (Prompt-DT), which leverages the sequential modeling ability of the Transformer architecture and the prompt framework to achieve few-shot adaptation in offline RL. We design the trajectory prompt, which contains segments of the few-shot demonstrations, and encodes task-specific information to guide policy generation. Our experiments in five MuJoCo control benchmarks show that Prompt-DT is a strong few-shot learner without any extra finetuning on unseen target tasks. Prompt-DT outperforms its variants and strong meta offline RL baselines by a large margin with a trajectory prompt containing only a few timesteps. Prompt-DT is also robust to prompt length changes and can generalize to out-of-distribution (OOD) environments.
翻译:人类可以利用先前的经验,从少数演示中学习新任务。 与旨在通过更好的算法设计实现快速适应的离线元加强学习相反,我们调查了结构导导偏差对微小的学习能力的影响。 我们提议了快速决策变异器(Prompt-DT),它利用变异器结构的相继建模能力和快速框架在离线RL中实现微粒适应。 我们设计了轨迹快速,它包含微粒演示的部分内容,以及用于指导政策制定的编码特定任务信息。 我们在五个 MuJoCo 控制基准中的实验显示,快速DT是一强的少见的学习者,对看不见的目标任务不作任何额外的微调。 快速DT将其变异器和强大的元离线RL基线转换成一个大的边距,轨迹提示只包含几条时段。 快速DT还能够快速推动长的改变,并可以概括到分配以外的环境。