语言模型是很少的热学生 (Language Models are Few-Shot Learners)

Tom B. Brown,Benjamin Mann,Nick Ryder,Melanie Subbiah,Jared Kaplan,Prafulla Dhariwal,Arvind Neelakantan,Pranav Shyam,Girish Sastry,Amanda Askell,Sandhini Agarwal,Ariel Herbert-Voss,Gretchen Krueger,Tom Henighan,Rewon Child,Aditya Ramesh,Daniel M. Ziegler,Jeffrey Wu,Clemens Winter,Christopher Hesse,Mark Chen,Eric Sigler,Mateusz Litwin,Scott Gray,Benjamin Chess,Jack Clark,Christopher Berner,Sam McCandlish,Alec Radford,Ilya Sutskever,Dario Amodei

from arxiv, 40+32 pages

Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.

翻译：近期的工作表明,许多NLP任务和基准在很多NLP任务和基准上都取得了巨大进展,例如,通过对大量文本进行预先培训,然后对具体任务进行微调,随后对具体任务进行微调。虽然在建筑中通常任务和3-3级微调,但这一方法仍需要对数千或成千上万个实例的数据集进行具体任务和微调。相比之下,人类通常只能从几个例子或简单的指令中执行一项新的语言任务,而当前的NLPP系统在很大程度上仍然难以完成。在这里,我们表明,扩大语言模式极大地改进了任务和微调性能,有时甚至达到先前最先进的3级微调方法的竞争力。具体地说,我们培训GPT-3级的自动递增语言模型,一个具有175亿参数的自动递减性语言模型,比以前的任何非抽减语言模型多10倍。相比之下,人类一般任务,GPT-3级(GPT)系统一般没有做任何梯度更新或微调,任务和微调的演示仍然完全通过文本与模型的互动关系来进行。GPT-3级微调。GPT-3级微调,我们发现许多NPPD数据数据集的高级数据设置、问题在翻译、问题上取得了强大的表现,我们从翻译、问答、问答,在GPTLLLLLLLT-LLT-LT-LD-LD-LT-LD-LT-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-