Recent studies have shown that CLIP has achieved remarkable success in performing zero-shot inference while its fine-tuning performance is not satisfactory. In this paper, we identify that fine-tuning performance is significantly impacted by hyper-parameter choices. We examine various key hyper-parameters and empirically evaluate their impact in fine-tuning CLIP for classification tasks through a comprehensive study. We find that the fine-tuning performance of CLIP is substantially underestimated. Equipped with hyper-parameter refinement, we demonstrate CLIP itself is better or at least competitive in fine-tuning compared with large-scale supervised pre-training approaches or latest works that use CLIP as prediction targets in Masked Image Modeling. Specifically, CLIP ViT-Base/16 and CLIP ViT-Large/14 can achieve 85.7%,88.0% finetuning Top-1 accuracy on the ImageNet-1K dataset . These observations challenge the conventional conclusion that CLIP is not suitable for fine-tuning, and motivate us to rethink recently proposed improvements based on CLIP. We will release our code publicly at \url{https://github.com/LightDXY/FT-CLIP}.
翻译:最近的研究显示,CLIP在微调性能不尽人意的情况下,在进行零光推断方面取得了显著成功,而它的微调性能则不尽人意。在本文中,我们发现微调性能受到超参数选择的重大影响。我们通过全面研究,检查了各种关键超参数,并用经验评估了它们在微调CLIP对分类任务的影响。我们发现,CLIP的微调性能被大大低估了。在超光谱改进后,我们显示出CLIP本身在微调方面比大规模受监督的训练前方法或使用CLIP作为蒙蔽图像模型预测目标的最新工程更好或至少具有竞争力。具体地说,CLIP Vit-Base/16和CLIP Vit-Large/14可以在图像Net-1K数据集上实现85.7%、88.8%的微调顶端-1精度。这些观察对常规结论提出了质疑,即CLIP不适于微调,并且激励我们重新思考最近根据CLIP提出的改进建议。我们将在以下公布我们的代码:urlhttp://YGLUBT/X/ightLIP.我们将在urat.我们公开公布我们的代码。