In modern neural networks like Transformers, linear layers require significant memory to store activations during backward pass. This study proposes a memory reduction approach to perform backpropagation through linear layers. Since the gradients of linear layers are computed by matrix multiplications, we consider methods for randomized matrix multiplications and demonstrate that they require less memory with a moderate decrease of the test accuracy. Also, we investigate the variance of the gradient estimate induced by the randomized matrix multiplication. We compare this variance with the variance coming from gradient estimation based on the batch of samples. We demonstrate the benefits of the proposed method on the fine-tuning of the pre-trained RoBERTa model on GLUE tasks.
翻译:在诸如变换器等现代神经网络中,线性层需要大量内存才能在后传过程中存储活化。 本研究建议采用内存减少法,通过线性层进行反向反向推进。 由于线性层梯度是用矩阵乘法计算的,因此我们考虑随机化矩阵乘法的方法,并表明它们需要的内存较少,测试精度则略微下降。此外,我们还调查随机化矩阵乘法引起的梯度估计差异。我们将这一差异与根据样本批量进行的梯度估计所产生的差异进行比较。我们展示了拟议的方法对GLUE任务事先培训的ROBERTA模型进行微调的好处。