超越滑动窗口：在非马尔可夫环境中学习管理记忆 (Beyond Sliding Windows: Learning to Manage Memory in Non-Markovian Environments)

Recent success in developing increasingly general purpose agents based on sequence models has led to increased focus on the problem of deploying computationally limited agents within the vastly more complex real-world. A key challenge experienced in these more realistic domains is highly non-Markovian dependencies with respect to the agent's observations, which are less common in small controlled domains. The predominant approach for dealing with this in the literature is to stack together a window of the most recent observations (Frame Stacking), but this window size must grow with the degree of non-Markovian dependencies, which results in prohibitive computational and memory requirements for both action inference and learning. In this paper, we are motivated by the insight that in many environments that are highly non-Markovian with respect to time, the environment only causally depends on a relatively small number of observations over that time-scale. A natural direction would then be to consider meta-algorithms that maintain relatively small adaptive stacks of memories such that it is possible to express highly non-Markovian dependencies with respect to time while considering fewer observations at each step and thus experience substantial savings in both compute and memory requirements. Hence, we propose a meta-algorithm (Adaptive Stacking) for achieving exactly that with convergence guarantees and quantify the reduced computation and memory constraints for MLP, LSTM, and Transformer-based agents. Our experiments utilize popular memory tasks, which give us control over the degree of non-Markovian dependencies. This allows us to demonstrate that an appropriate meta-algorithm can learn the removal of memories not predictive of future rewards without excessive removal of important experiences. Code: https://github.com/geraudnt/adaptive-stacking

翻译：基于序列模型构建日益通用的智能体所取得的近期成功，使得人们更加关注在远为复杂的现实世界中部署计算能力有限的智能体的问题。在这些更现实的领域中，一个关键挑战是智能体观测值之间存在高度非马尔可夫的依赖关系，这在小型受控领域中较为少见。文献中处理此问题的主要方法是堆叠最近观测值的窗口（帧堆叠），但该窗口大小必须随着非马尔可夫依赖程度的增加而增长，这导致动作推断和学习过程在计算和内存需求上变得难以承受。本文的动机源于一个洞见：在许多在时间维度上高度非马尔可夫的环境中，环境实际上仅因果依赖于该时间尺度上相对少量的观测值。一个自然的方向是考虑元算法，这些算法维护相对较小的自适应记忆堆栈，使得能够表达时间上的高度非马尔可夫依赖关系，同时在每一步考虑更少的观测值，从而在计算和内存需求上实现大幅节省。因此，我们提出了一种元算法（自适应堆叠）来实现这一目标，并提供了收敛性保证，同时量化了基于MLP、LSTM和Transformer的智能体在计算和内存约束上的降低。我们的实验利用了流行的记忆任务，这使我们能够控制非马尔可夫依赖的程度。这使我们能够证明，一个合适的元算法可以学会移除对未来奖励没有预测性的记忆，而不会过度移除重要的经验。代码：https://github.com/geraudnt/adaptive-stacking