Considerable progress has recently been made in leveraging CLIP (Contrastive Language-Image Pre-Training) models for text-guided image manipulation. However, all existing works rely on additional generative models to ensure the quality of results, because CLIP alone cannot provide enough guidance information for fine-scale pixel-level changes. In this paper, we introduce CLIPVG, a text-guided image manipulation framework using differentiable vector graphics, which is also the first CLIP-based general image manipulation framework that does not require any additional generative models. We demonstrate that CLIPVG can not only achieve state-of-art performance in both semantic correctness and synthesis quality, but also is flexible enough to support various applications far beyond the capability of all existing methods.
翻译:最近,在利用CLIP(培训前语言图像控制)模型进行文字制导图像操作方面取得了相当大的进展,然而,所有现有工作都依赖额外的基因模型来确保结果质量,因为光靠CLIP无法提供足够的指导信息来进行微规模像素级的改变。 在本文中,我们引入了CLIPVG,这是一个使用不同矢量图形的文本制图像操纵框架,这也是基于CLIP(CLIP)的首个通用图像操纵框架,不需要任何额外的基因模型。 我们表明,CLIPVG不仅能够在语义正确性和合成质量两方面都达到最先进的性能,而且具有足够的灵活性,可以支持远远超出所有现有方法能力的各种应用。