Plaintiffs and defendants in copyright lawsuits over generative AI often make sweeping, opposing claims about the extent to which large language models (LLMs) have memorized plaintiffs' protected expression in their training data. Drawing on both machine learning and copyright law, we show that these polarized positions dramatically oversimplify the relationship between memorization and copyright. To do so, we extend a recent probabilistic extraction technique to measure memorization of 50 books in 17 open-weight LLMs. Through thousands of experiments, we show that the extent of memorization varies both by model and by book. With respect to our specific extraction methodology, we find that most LLMs do not memorize most books -- either in whole or in part. However, we also find that Llama 3.1 70B entirely memorizes some books, like the first Harry Potter book and 1984. In fact, the first Harry Potter is so memorized that, using a seed prompt consisting of just the first few tokens of the first chapter, we can deterministically generate the entire book near-verbatim. We discuss why our results have significant implications for copyright cases, though not ones that unambiguously favor either side.
翻译:在关于生成式人工智能的版权诉讼中,原告与被告常就大型语言模型(LLMs)在训练数据中对受保护表达的记忆程度提出截然相反的笼统主张。结合机器学习与版权法的视角,我们证明这些极端立场过度简化了记忆行为与版权之间的关系。为此,我们扩展了一种近期提出的概率提取技术,用于测量17个开源权重LLMs对50本书籍的记忆程度。通过数千次实验,我们发现记忆程度因模型和书籍而异。就我们采用的特定提取方法而言,大多数LLMs并未完整或部分地记忆大多数书籍。然而,我们也发现Llama 3.1 70B模型完全记忆了部分书籍,例如《哈利·波特》第一部与《1984》。事实上,《哈利·波特》第一部的记忆程度如此之高,仅需以第一章开头的数个词元作为种子提示,即可近乎逐字地确定性生成整本书籍。我们讨论了这些结果对版权案件的重要启示——尽管其结论并非明确偏向任何一方。