CLIP has emerged as a powerful multimodal model capable of connecting images and text through joint embeddings, but to what extent does it 'see' the same way humans do - especially when interpreting artworks? In this paper, we investigate CLIP's ability to extract high-level semantic and stylistic information from paintings, including both human-created and AI-generated imagery. We evaluate its perception across multiple dimensions: content, scene understanding, artistic style, historical period, and the presence of visual deformations or artifacts. By designing targeted probing tasks and comparing CLIP's responses to human annotations and expert benchmarks, we explore its alignment with human perceptual and contextual understanding. Our findings reveal both strengths and limitations in CLIP's visual representations, particularly in relation to aesthetic cues and artistic intent. We further discuss the implications of these insights for using CLIP as a guidance mechanism during generative processes, such as style transfer or prompt-based image synthesis. Our work highlights the need for deeper interpretability in multimodal systems, especially when applied to creative domains where nuance and subjectivity play a central role.
翻译:CLIP已成为一种强大的多模态模型,能够通过联合嵌入连接图像与文本,但它在多大程度上以与人类相同的方式'看待'世界——尤其是在解读艺术作品时?本文研究了CLIP从绘画(包括人类创作和AI生成的图像)中提取高层次语义与风格信息的能力。我们从多个维度评估其感知能力:内容、场景理解、艺术风格、历史时期,以及视觉变形或伪影的存在。通过设计针对性探测任务,并将CLIP的响应与人工标注及专家基准进行比较,我们探索了其与人类感知及语境理解的一致性。研究结果揭示了CLIP视觉表征的优势与局限,尤其是在美学线索和艺术意图关联方面。我们进一步讨论了这些发现对在生成过程中(如风格迁移或基于提示的图像合成)使用CLIP作为引导机制的启示。本工作强调了对多模态系统进行更深入可解释性研究的必要性,特别是在应用于以细微差别和主观性为核心的创意领域时。