UL2:统一语言学习范式</s> (UL2: Unifying Language Learning Paradigms)

Yi Tay,Mostafa Dehghani,Vinh Q. Tran,Xavier Garcia,Jason Wei,Xuezhi Wang,Hyung Won Chung,Siamak Shakeri,Dara Bahri,Tal Schuster,Huaixiu Steven Zheng,Denny Zhou,Neil Houlsby,Donald Metzler

from arxiv, Updated Q1 2023 with Flan-UL2 20B release! :)

Existing pre-trained models are generally geared towards a particular class of problems. To date, there seems to be still no consensus on what the right architecture and pre-training setup should be. This paper presents a unified framework for pre-training models that are universally effective across datasets and setups. We begin by disentangling architectural archetypes with pre-training objectives -- two concepts that are commonly conflated. Next, we present a generalized & unified perspective for self-supervision in NLP and show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective. We then propose Mixture-of-Denoisers (MoD), a pre-training objective that combines diverse pre-training paradigms together. We furthermore introduce a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes. We conduct extensive ablative experiments to compare multiple pre-training objectives and find that our method pushes the Pareto-frontier by outperforming T5 & GPT-like models across multiple diverse setups. By scaling our model up to 20B parameters, we achieve SOTA performance on 50 well-established supervised finetuning based NLP tasks. Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization. On 0-shot MMLU, UL2 20B outperforms T0 and T5 models. UL2 20B also works well with chain-of-thought prompting and reasoning, making it an appealing choice for research into reasoning at a small to medium scale of 20B parameters. Finally, we apply FLAN instruction tuning to the UL2 20B model, achieving MMLU and Big-Bench scores competitive to FLAN-PaLM 62B. We release Flax-based T5X checkpoints for the UL2 20B & Flan-UL2 20B.

翻译：训练前的模型一般都针对某类问题。到目前为止,似乎还没有就正确的架构和训练前设置应该什么是合适的结构或训练前设置达成共识。本文为培训前模型提供了一个统一的框架, 在整个数据集和设置之间具有普遍效力。我们从拆译建筑型号开始, 包括培训前的目标 -- -- 两个通常是混在一起的概念。我们为NLP的自我监督展示了一个普遍的和统一的观点, 并展示了不同的训练前的参数可以相互展示, 以及不同目标之间如何相互调试。然后, 我们提议了一个训练前模型( MD), 一个将不同的训练前模式( MD) 合并起来。我们还引入了一个模式转换概念, 即下游的微调与具体的训练前计划相关。我们进行了广泛的推导实验, 比较了多项训练前目标, 并发现我们的方法将20- PL 20- 级的测试前目标推向FL 。我们把T5 和 GPT 的模型推向不同的模型推向多个组合。我们用了20- L 级的20- L 级模型, 在20 B 高级的测试前的模型上, 在20 B 的测试前的模型上, 在20 B 的模型上, 实现了我们的模级的模化, 在20 B 的模级的模级的模实现了20- b 的的模级的模的模的的的的的的的的的的的的和的 20B 的的的的的的的的的的的的的的的的的的的的的的和的的的的的的的的的的的的的的的的的的的的的的的的的的的的 20- 和的的的 20- 20- b 的的的的的的的的的的的的的的和的的 20- 20- 20- b 20- b </s>

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/