A zero-shot RL agent is an agent that can solve any RL task in a given environment, instantly with no additional planning or learning, after an initial reward-free learning phase. This marks a shift from the reward-centric RL paradigm towards "controllable" agents that can follow arbitrary instructions in an environment. Current RL agents can solve families of related tasks at best, or require planning anew for each task. Strategies for approximate zero-shot RL ave been suggested using successor features (SFs) [BBQ+ 18] or forward-backward (FB) representations [TO21], but testing has been limited. After clarifying the relationships between these schemes, we introduce improved losses and new SF models, and test the viability of zero-shot RL schemes systematically on tasks from the Unsupervised RL benchmark [LYL+21]. To disentangle universal representation learning from exploration, we work in an offline setting and repeat the tests on several existing replay buffers. SFs appear to suffer from the choice of the elementary state features. SFs with Laplacian eigenfunctions do well, while SFs based on auto-encoders, inverse curiosity, transition models, low-rank transition matrix, contrastive learning, or diversity (APS), perform unconsistently. In contrast, FB representations jointly learn the elementary and successor features from a single, principled criterion. They perform best and consistently across the board, reaching 85% of supervised RL performance with a good replay buffer, in a zero-shot manner.
翻译:零发RL代理商是一种在初始无报酬学习阶段后可以解决特定环境中任何RL任务的代理商,不做任何额外的规划或学习,在初始无报酬学习阶段后,可以立即解决任何RL任务,但不做额外的规划或学习。这标志着从以奖励为中心的RL范式向“可控制”代理商的转变,在环境中可以系统地遵循任意指令。当前的RL代理商可以解决相关任务的家庭,或者需要为每项任务制定新的规划。近乎零发RLa的代理商是使用后续特征(SF) [BB ⁇ 18] 或缓冲(FB) 表示的策略,但测试是有限的。在澄清了这些计划之间的关系之后,我们引入了更好的损失和新的SF模式,并系统地测试了零发RL计划的可行性,从不超高的RL基准[L+21] 中系统地测试了任务。要从探索中分解通用代表,我们在离线上设置和重复对现有一些基本缓冲的测试。SFFSF似乎要通过选择基本状态特征而受到影响。