Artificial intelligence (AI) systems, and Large Language Models (LLMs) in particular, are increasingly employed for creative tasks like scientific idea generation, constituting a form of generalization from training data unaddressed by existing conceptual frameworks. Despite its similarities to compositional generalization (CG), combinatorial creativity (CC) is an open-ended ability. Instead of evaluating for accuracy or correctness against fixed targets, which would contradict the open-ended nature of CC, we propose a theoretical framework and algorithmic task for evaluating outputs by their degrees of novelty and utility. From here, we make several important empirical contributions: (1) We obtain the first insights into the scaling behavior of creativity for LLMs. (2) We discover that, for fixed compute budgets, there exist optimal model depths and widths for creative ability. (3) We find that the ideation-execution gap, whereby LLMs excel at generating novel scientific ideas but struggle to ensure their practical feasibility, may be explained by a more fundamental novelty-utility tradeoff characteristic of creativity algorithms in general. Importantly, this tradeoff remains persistent even at scale, casting doubt on the long-term creative potential of LLMs in their current form. Together, our conceptual framework and empirical findings provide a foundation for understanding and improving creativity in modern AI models, bridging the gap between human and machine intelligence.
翻译:人工智能(AI)系统,尤其是大型语言模型(LLMs),正日益被应用于科学创意生成等创造性任务,这构成了一种现有概念框架尚未涉及的、从训练数据中泛化的形式。尽管组合创造力(CC)与组合泛化(CG)具有相似性,但CC是一种开放式的能力。不同于针对固定目标评估准确性或正确性(这与CC的开放式本质相矛盾),我们提出了一个理论框架和算法任务,通过输出的新颖性和实用性程度来评估它们。基于此,我们做出了几项重要的实证贡献:(1)我们首次获得了关于LLMs创造力缩放行为的见解。(2)我们发现,在固定的计算预算下,存在实现创造能力的最佳模型深度和宽度。(3)我们发现,LLMs在生成新颖科学创意方面表现出色,但难以确保其实际可行性的这种构思-执行差距,可能由创造力算法中更基本的新颖性-实用性权衡所解释。重要的是,这种权衡即使在规模扩大时仍然持续存在,这对LLMs当前形式的长期创造潜力提出了质疑。总之,我们的概念框架和实证发现为理解和改进现代AI模型中的创造力奠定了基础,弥合了人类与机器智能之间的差距。