质量-多样性变换器：使用决策变换器生成行为条件下的轨迹 (The Quality-Diversity Transformer: Generating Behavior-Conditioned Trajectories with Decision Transformers)

In the context of neuroevolution, Quality-Diversity algorithms have proven effective in generating repertoires of diverse and efficient policies by relying on the definition of a behavior space. A natural goal induced by the creation of such a repertoire is trying to achieve behaviors on demand, which can be done by running the corresponding policy from the repertoire. However, in uncertain environments, two problems arise. First, policies can lack robustness and repeatability, meaning that multiple episodes under slightly different conditions often result in very different behaviors. Second, due to the discrete nature of the repertoire, solutions vary discontinuously. Here we present a new approach to achieve behavior-conditioned trajectory generation based on two mechanisms: First, MAP-Elites Low-Spread (ME-LS), which constrains the selection of solutions to those that are the most consistent in the behavior space. Second, the Quality-Diversity Transformer (QDT), a Transformer-based model conditioned on continuous behavior descriptors, which trains on a dataset generated by policies from a ME-LS repertoire and learns to autoregressively generate sequences of actions that achieve target behaviors. Results show that ME-LS produces consistent and robust policies, and that its combination with the QDT yields a single policy capable of achieving diverse behaviors on demand with high accuracy.

翻译：在神经进化的背景下，质量-多样性算法已经在通过定义行为空间来生成多样化和高效的策略方案上证明了有效性。这种多样性产生的自然目标是尝试按需实现行为，这可以通过运行相应的策略来完成。然而，在不确定的环境中，会出现两个问题。首先，策略可能缺乏稳健性和可重复性，这意味着在稍微不同的条件下运行多个剧集通常会导致非常不同的行为。其次，由于复刻集的离散性质，解的变动是不连续的。这里我们提出了一种新的方法，即基于两种机制的行为条件下轨迹生成：第一个机制是MAP-Elite Low-spread（ME-LS），它限制了解决方案的选择范围，使其最符合行为空间的选择。第二个机制是质量-多样性变换器（QDT），它是一种基于连续行为描述符的变压器模型，该模型使用ME-LS复制中策略生成的数据集进行训练，学习自回归的产生实现目标行为的动作序列。结果表明，ME-LS可以生成一致且稳健的策略，而与QDT相结合，则可以生成具有高准确性的多样化行为。