基于注意力机制的 softmax 回归模型 (Attention Scheme Inspired Softmax Regression)

Large language models (LLMs) have made transformed changes for human society. One of the key computation in LLMs is the softmax unit. This operation is important in LLMs because it allows the model to generate a distribution over possible next words or phrases, given a sequence of input words. This distribution is then used to select the most likely next word or phrase, based on the probabilities assigned by the model. The softmax unit plays a crucial role in training LLMs, as it allows the model to learn from the data by adjusting the weights and biases of the neural network. In the area of convex optimization such as using central path method to solve linear programming. The softmax function has been used a crucial tool for controlling the progress and stability of potential function [Cohen, Lee and Song STOC 2019, Brand SODA 2020]. In this work, inspired the softmax unit, we define a softmax regression problem. Formally speaking, given a matrix $A \in \mathbb{R}^{n \times d}$ and a vector $b \in \mathbb{R}^n$, the goal is to use greedy type algorithm to solve \begin{align*} \min_{x} \| \langle \exp(Ax), {\bf 1}_n \rangle^{-1} \exp(Ax) - b \|_2^2. \end{align*} In certain sense, our provable convergence result provides theoretical support for why we can use greedy algorithm to train softmax function in practice.

翻译：大型语言模型 (LLMs) 对人类社会已经带来了巨大的变革。在 LLMs 中，softmax 单元是关键的计算之一。这个操作在 LLMs 中非常重要，因为它允许模型在给定输入单词序列的情况下生成可能的下一个单词或短语的分布。然后，该分布被用来选择最有可能的下一个单词或短语，基于模型分配的概率。softmax 单元在 LLMs 的训练中发挥着至关重要的作用，因为它允许模型通过调整神经网络的权重和偏置来从数据中学习。在凸优化领域，例如使用中心路径法来解决线性规划，在控制潜在函数的进展和稳定性方面，softmax 函数已被用作至关重要的工具 [Cohen、Lee 和 Song STOC 2019，Brand SODA 2020]。在本文中，受到 softmax 单元的启发，我们定义了一个 softmax 回归模型。具体而言，给定一个矩阵$ A \in \mathbb {R} ^ {n \times d}$ 和向量 $b \in \mathbb {R} ^ n$，目标是使用贪婪算法来解决\begin{align*} \min_{x} \| \langle \exp(Ax), {\bf 1}_n \rangle^{-1} \exp(Ax) - b \|_2^2. \end{align*} 在某种意义上，我们的可证收敛性结果为什么我们可以在实践中使用贪婪算法来训练 softmax 函数提供了理论支持。