In this work, we propose a sparse transformer architecture that incorporates prior information about the underlying data distribution directly into the transformer structure of the neural network. The design of the model is motivated by a special optimal transport problem, namely the regularized Wasserstein proximal operator, which admits a closed-form solution and turns out to be a special representation of transformer architectures. Compared with classical flow-based models, the proposed approach improves the convexity properties of the optimization problem and promotes sparsity in the generated samples. Through both theoretical analysis and numerical experiments, including applications in generative modeling and Bayesian inverse problems, we demonstrate that the sparse transformer achieves higher accuracy and faster convergence to the target distribution than classical neural ODE-based methods.
翻译:本文提出了一种稀疏Transformer架构,它将关于底层数据分布的先验信息直接融入神经网络的Transformer结构中。该模型的设计灵感来源于一个特殊的最优传输问题,即正则化Wasserstein近端算子,该算子具有闭式解,并恰好构成Transformer架构的一种特殊表示。与经典的基于流的模型相比,所提方法优化了问题的凸性,并促进了生成样本的稀疏性。通过理论分析和数值实验(包括在生成建模和贝叶斯逆问题中的应用),我们证明该稀疏Transformer相比经典的基于神经ODE的方法,能够以更高的精度和更快的速度收敛到目标分布。