The aesthetic quality assessment task is crucial for developing a human-aligned quantitative evaluation system for AIGC. However, its inherently complex nature, spanning visual perception, cognition, and emotion, poses fundamental challenges. Although aesthetic descriptions offer a viable representation of this complexity, two critical challenges persist: (1) data scarcity and imbalance: existing dataset overly focuses on visual perception and neglects deeper dimensions due to the expensive manual annotation; and (2) model fragmentation: current visual networks isolate aesthetic attributes with multi-branch encoder, while multimodal methods represented by contrastive learning struggle to effectively process long-form textual descriptions. To resolve challenge (1), we first present the Refined Aesthetic Description (RAD) dataset, a large-scale (70k), multi-dimensional structured dataset, generated via an iterative pipeline without heavy annotation costs and easy to scale. To address challenge (2), we propose ArtQuant, an aesthetics assessment framework for artistic images which not only couples isolated aesthetic dimensions through joint description generation, but also better models long-text semantics with the help of LLM decoders. Besides, theoretical analysis confirms this symbiosis: RAD's semantic adequacy (data) and generation paradigm (model) collectively minimize prediction entropy, providing mathematical grounding for the framework. Our approach achieves state-of-the-art performance on several datasets while requiring only 33% of conventional training epochs, narrowing the cognitive gap between artistic images and aesthetic judgment. We will release both code and dataset to support future research.
翻译:美学质量评估任务对于构建与人类对齐的AIGC量化评价体系至关重要。然而,其本质跨越视觉感知、认知与情感等多个维度,具有内在复杂性,构成了根本性挑战。尽管美学描述为表征这种复杂性提供了可行方案,但两大关键挑战依然存在:(1) 数据稀缺与不平衡:由于人工标注成本高昂,现有数据集过度聚焦于视觉感知层面,而忽略了更深维度的美学属性;(2) 模型割裂:当前视觉网络采用多分支编码器孤立处理美学属性,而以对比学习为代表的多模态方法难以有效处理长文本描述。为解决挑战(1),我们首先提出了精炼美学描述数据集,这是一个通过迭代流程生成的大规模、多维度结构化数据集,无需高昂标注成本且易于扩展。针对挑战(2),我们提出了ArtQuant艺术图像美学评估框架,该框架不仅通过联合描述生成耦合了孤立的美学维度,还借助LLM解码器更好地建模长文本语义。此外,理论分析证实了这种协同关系:RAD的语义充分性与生成范式共同最小化预测熵,为框架提供了数学基础。我们的方法在多个数据集上取得了最先进的性能,且仅需传统训练轮数的33%,显著缩小了艺术图像与美学判断之间的认知鸿沟。我们将公开代码与数据集以支持未来研究。