Generative flow networks (GFlowNets) are a method for learning a stochastic policy for generating compositional objects, such as graphs or strings, from a given unnormalized density by sequences of actions, where many possible action sequences may lead to the same object. We find previously proposed learning objectives for GFlowNets, flow matching and detailed balance, which are analogous to temporal difference learning, to be prone to inefficient credit propagation across long action sequences. We thus propose a new learning objective for GFlowNets, trajectory balance, as a more efficient alternative to previously used objectives. We prove that any global minimizer of the trajectory balance objective can define a policy that samples exactly from the target distribution. In experiments on four distinct domains, we empirically demonstrate the benefits of the trajectory balance objective for GFlowNet convergence, diversity of generated samples, and robustness to long action sequences and large action spaces.
翻译:生成流网络(GFlowNets)是一种方法,用于学习一种从一个特定不正规的密度以一系列行动生成成像物体(如图表或字符串)的随机政策,其中,许多可能的行动序列可能导致同一目标。我们发现,以前提议的GFlowNets的学习目标、流动匹配和详细平衡,类似于时间差异学习,容易在长的行动序列中低效地传播信用。因此,我们为GFlowNets提出了一个新的学习目标,即轨迹平衡,作为以前使用的目标的一种更有效的替代。我们证明,轨道平衡目标的任何全球最小化者都可以确定一个完全从目标分布中抽样的政策。在四个不同领域的实验中,我们从经验上展示了轨道平衡目标对于GFlowNet的趋同、生成样本的多样性以及长动作序列和大动作空间的稳健性的好处。