Transformer models typically calculate attention matrices using dot products, which have limitations when capturing nonlinear relationships between embedding vectors. We propose Neural Attention, a technique that replaces dot products with feed-forward networks, enabling a more expressive representation of relationships between tokens. This approach modifies only the attention matrix calculation while preserving the matrix dimensions, making it easily adaptable to existing transformer-based architectures. We provide a detailed mathematical justification for why Neural Attention increases representational capacity and conduct controlled experiments to validate this claim. When comparing Neural Attention and Dot-Product Attention, NLP experiments on WikiText-103 show a reduction in perplexity of over 2 percent. Similarly, experiments on CIFAR-10 and CIFAR-100 show improvements in accuracy of more than 4 percentage points for image classification tasks. While Neural Attention introduces higher computational demands, we develop techniques to mitigate these challenges, ensuring practical usability without sacrificing the increased expressivity it provides. This work establishes Neural Attention as an effective means of enhancing the predictive capabilities of transformer models across a variety of applications. The code for all experiments is available at https://github.com/awayfromzel/neural-attention-research.
翻译:Transformer模型通常使用点积计算注意力矩阵,但该方法在捕捉嵌入向量间的非线性关系时存在局限性。我们提出神经注意力技术,以前馈网络替代点积运算,从而实现对词元间关系更具表达力的表征。该方法仅修改注意力矩阵的计算方式,同时保持矩阵维度不变,使其易于适配现有基于Transformer的架构。我们提供了神经注意力为何能增强表征能力的详细数学论证,并通过对照实验验证该主张。在比较神经注意力与点积注意力时,基于WikiText-103的自然语言处理实验显示困惑度降低超过2%。同样,在CIFAR-10和CIFAR-100数据集上的图像分类实验表明准确率提升超过4个百分点。尽管神经注意力会带来更高的计算需求,我们开发了相应技术以缓解这些挑战,在保持其增强表达能力的同时确保实际可用性。本工作确立了神经注意力作为提升Transformer模型在多种应用中预测能力的有效手段。所有实验代码已发布于https://github.com/awayfromzel/neural-attention-research。