Training vision or language models on large datasets can take days, if not weeks. We show that averaging the weights of the k latest checkpoints, each collected at the end of an epoch, can speed up the training progression in terms of loss and accuracy by dozens of epochs, corresponding to time savings up to ~68 and ~30 GPU hours when training a ResNet50 on ImageNet and RoBERTa-Base model on WikiText-103, respectively. We also provide the code and model checkpoint trajectory to reproduce the results and facilitate research on reusing historical weights for faster convergence.
翻译:大型数据集的培训愿景或语言模型可能需要数天甚至数周时间。我们显示,在时代末收集的每个最新的检查站的平均重量可以加快数十个时代在损失和准确性方面的培训进度,这相当于在分别对图像网和WikitText-103的RoBERTA-Base模型进行ResNet50培训时节省了多达~68小时和~30 GPU小时的时间。我们还提供了代码和模式检查站轨迹,以复制结果,并促进关于重用历史重量加快趋同的研究。