The scaled-dot-product attention (SDPA) mechanism is a core component of modern deep learning, but its mathematical form is often motivated by heuristics. This work provides a first-principles justification for SDPA. We first show that the attention forward pass is the exact solution to a degenerate, one-sided Entropic Optimal Transport (EOT) problem, which seeks a distribution that maximizes similarity while being maximally entropic. This optimization perspective has a direct consequence for the backward pass. We prove that the standard gradient computed via backpropagation is mathematically identical to an advantage-based policy gradient, a variance-reduced update rule from reinforcement learning. Crucially, we demonstrate that the EOT formulation of the forward pass induces a specific information geometry on the space of attention distributions. It is this geometry, characterized by the Fisher Information Matrix, that dictates the precise form of the learning gradient, revealing the advantage-based update as a natural consequence of the optimization problem being solved. This unified view reveals SDPA as a principled mechanism where the forward pass performs optimal inference and the backward pass implements a rational, manifold-aware learning update.
翻译:缩放点积注意力(SDPA)机制是现代深度学习的核心组件,但其数学形式常由启发式方法驱动。本研究为SDPA提供了第一性原理的论证。我们首先证明注意力前向传播是一个退化单侧熵最优传输(EOT)问题的精确解,该问题旨在寻找一个在最大化相似性的同时保持最大熵的分布。这一优化视角对反向传播具有直接推论。我们证明通过反向传播计算的标准梯度在数学上等同于基于优势的策略梯度——一种来自强化学习的方差缩减更新规则。关键在于,我们证明了前向传播的EOT公式在注意力分布空间上诱导出特定的信息几何结构。正是这种以费舍尔信息矩阵为特征的几何结构决定了学习梯度的精确形式,揭示出基于优势的更新是所求解优化问题的自然结果。这一统一视角表明SDPA是一种原理性机制:前向传播执行最优推断,而反向传播则实现合理的、流形感知的学习更新。