Sparsity, ..., is another important algorithmic advance that can greatly improve efficiency. 稀疏性,是(神经架构搜索)之外另一个重要的算法进步,可以大大提高效率。
The use of sparsity in models is ... very high potential payoff in terms of computational efficiency, and we are only scratching the surface ... 在模型中使用稀疏性在计算效率方面具有非常高的潜在回报,我们仅仅触及皮毛。
嵌入在循环(recurrent)语言模型中的专家混合 (Mixture of Experts,MoE) 层。在这种情况下,稀疏门控函数选择两个专家来执行计算。它们的输出由门控网络的输出调制。
稀疏门控 MoE,实现了模型容量超过1000倍的改进,并且在现代 GPU 集群的计算效率损失很小。
Switch Transformer:通过简单高效的稀疏性扩展到万亿参数模型
Mixture of Experts (MoE) ... a sparsely-activated model -- with outrageous numbers of parameters -- but a constant computational cost. 专家混合模型(MoE),...,一个稀疏激活的模型 - 具有惊人的参数数量 - 但计算成本恒定。
...It consumes only 1/3 of the energy used to train GPT-3 and requires half of the computation flops for inference... ...它只消耗 GPT-3训练所需能耗的1/3,并且只需要一半的浮点运算进行推理...
A crucial observation is that there is an inherent tension between how few similarity scores one computes and the flow of information between different nodes (i.e., the ability of one token to influence each other). 一个重要的观察结果是,在计算的相似性得分如何的少,和不同节点间的信息流(即,一个标记相互影响的能力)之间存在一种内在的张力关系。
... carefully designed sparse attention can be as expressive and flexible as the original full attention model. Along with theoretical guarantees, ... a very efficient implementation allows us to scale to much longer inputs. 完全注意力模型可以看作是一个完全图 ... 精心设计的稀疏注意力和原始的全注意模型一样具有表达性和灵活性。除了理论上的保证之外,非常高效的实现使我们能够扩展到更长的输入。