Contrastive Vision-Language Pre-training, known as CLIP, has provided a new paradigm for learning visual representations using large-scale image-text pairs. It shows impressive performance on downstream tasks by zero-shot knowledge transfer. To further enhance CLIP's adaption capability, existing methods proposed to fine-tune additional learnable modules, which significantly improves the few-shot performance but introduces extra training time and computational resources. In this paper, we propose a training-free adaption method for CLIP to conduct few-shot classification, termed as Tip-Adapter, which not only inherits the training-free advantage of zero-shot CLIP but also performs comparably to those training-required approaches. Tip-Adapter constructs the adapter via a key-value cache model from the few-shot training set, and updates the prior knowledge encoded in CLIP by feature retrieval. On top of that, the performance of Tip-Adapter can be further boosted to be state-of-the-art on ImageNet by fine-tuning the cache model for 10$\times$ fewer epochs than existing methods, which is both effective and efficient. We conduct extensive experiments of few-shot classification on 11 datasets to demonstrate the superiority of our proposed methods. Code is released at https://github.com/gaopengcuhk/Tip-Adapter.
翻译:名为 CLIP 的“ 视觉- 视觉- Language 预培训”, 为使用大型图像- 文本配对来学习视觉表现提供了一种新的模式, 以学习视觉表现提供了一种新的模式, 它通过零光知识转让, 展示了下游任务令人印象深刻的成绩。 为了进一步加强 CLIP 的适应能力, 提议对现有方法进行微调可学习模块的微调调整, 这极大地改进了微小的性能, 但也引入了额外的培训时间和计算资源。 在本文中, 我们提议了CLIP 进行微小分分级的无培训调整方法, 即Tip- Adapter, 不仅继承了零光光的 CLIP 的无培训优势, 而且还具有与这些培训所要求的方法相容的。 Tip- Adapter 将调整适应器从微光速培训成套的组合组合中, 并更新了CLIP 中编码的先前知识。 此外, Tip- Adapter 的性能被进一步提升到图像网络上的最新艺术, 通过微调调整缓发的缓缓冲的缓冲机级模型, 。 我们提出的11 解的快速解的系统/ 的快速解算法则在10美元 演示中, 演示的11 演示法则在10xxxx