Internal features from large-scale pre-trained diffusion models have recently been established as powerful semantic descriptors for a wide range of downstream tasks. Works that use these features generally need to add noise to images before passing them through the model to obtain the semantic features, as the models do not offer the most useful features when given images with little to no noise. We show that this noise has a critical impact on the usefulness of these features that cannot be remedied by ensembling with different random noises. We address this issue by introducing a lightweight, unsupervised fine-tuning method that enables diffusion backbones to provide high-quality, noise-free semantic features. We show that these features readily outperform previous diffusion features by a wide margin in a wide variety of extraction setups and downstream tasks, offering better performance than even ensemble-based methods at a fraction of the cost.
翻译:大规模预训练扩散模型的内部特征最近已被确立为适用于广泛下游任务的强大语义描述符。使用这些特征的研究通常需要在将图像输入模型获取语义特征前添加噪声,因为当输入图像几乎没有噪声时,模型无法提供最具实用性的特征。我们证明,这种噪声对这些特征的有效性具有关键影响,且无法通过集成不同随机噪声来弥补。为此,我们提出一种轻量级、无监督的微调方法,使扩散主干网络能够提供高质量、无噪声的语义特征。研究表明,这些特征在多种提取设置和下游任务中均显著优于以往的扩散特征,其性能甚至超过基于集成的方法,而计算成本仅为后者的一小部分。