理解和改进视觉提示:标签-绘图视角 (Understanding and Improving Visual Prompting: A Label-Mapping Perspective)

We revisit and advance visual prompting (VP), an input prompting technique for vision tasks. VP can reprogram a fixed, pre-trained source model to accomplish downstream tasks in the target domain by simply incorporating universal prompts (in terms of input perturbation patterns) into downstream data points. Yet, it remains elusive why VP stays effective even given a ruleless label mapping (LM) between the source classes and the target classes. Inspired by the above, we ask: How is LM interrelated with VP? And how to exploit such a relationship to improve its accuracy on target tasks? We peer into the influence of LM on VP and provide an affirmative answer that a better 'quality' of LM (assessed by mapping precision and explanation) can consistently improve the effectiveness of VP. This is in contrast to the prior art where the factor of LM was missing. To optimize LM, we propose a new VP framework, termed ILM-VP (iterative label mapping-based visual prompting), which automatically re-maps the source labels to the target labels and progressively improves the target task accuracy of VP. Further, when using a contrastive language-image pretrained (CLIP) model, we propose to integrate an LM process to assist the text prompt selection of CLIP and to improve the target task accuracy. Extensive experiments demonstrate that our proposal significantly outperforms state-of-the-art VP methods. As highlighted below, we show that when reprogramming an ImageNet-pretrained ResNet-18 to 13 target tasks, our method outperforms baselines by a substantial margin, e.g., 7.9% and 6.7% accuracy improvements in transfer learning to the target Flowers102 and CIFAR100 datasets. Besides, our proposal on CLIP-based VP provides 13.7% and 7.1% accuracy improvements on Flowers102 and DTD respectively. Our code is available at https://github.com/OPTML-Group/ILM-VP.

翻译：我们重新审视并推进视觉提示( VP), 这是用于视觉任务的一种输入提示技术。 VP 可以重新编程一个固定的、预先训练的源代码模型, 完成目标域的下游任务, 只需将通用提示( 输入扰动模式) 纳入下游数据点即可。然而, 即便在源类和目标类之间没有规则的标签映射( LM ), VP 仍然难以保持有效。受上述因素的启发, 我们问: LPM 如何与 VP 建立关联? 如何利用这种关系来提高目标任务上的准确性? 我们对LM 的精度影响, 提供一个更高质量的 LM( 通过绘图精确和解释来评估) 质量, LP 的精度( 通过绘图) 目标的精度( IM) 的精度变精度( IM ), 提供新的 VP 框架, 名为 IM- VP 的直观定位/ 和直观提示), 将源标签自动重新映射到目标标签, 低于 VP 的精度目标点, 当我们演示的精度选择的精度( CIP 的精度) 的精度, 当我们演示的精度选择的精度的精度的精度的精度, 的精度的精度的精度的精度的精度的精度的精度的精度, 的精度的精度的精度选择的精度选择的精度的精度的精度的精度的精度的精度的精度调的精度的精度的精度, 。