Flammingo:少热学习的视觉语言模型 (Flamingo: a Visual Language Model for Few-Shot Learning)

Jean-Baptiste Alayrac,Jeff Donahue,Pauline Luc,Antoine Miech,Iain Barr,Yana Hasson,Karel Lenc,Arthur Mensch,Katie Millican,Malcolm Reynolds,Roman Ring,Eliza Rutherford,Serkan Cabi,Tengda Han,Zhitao Gong,Sina Samangooei,Marianne Monteiro,Jacob Menick,Sebastian Borgeaud,Andrew Brock,Aida Nematzadeh,Sahand Sharifzadeh,Mikolaj Binkowski,Ricardo Barreira,Oriol Vinyals,Andrew Zisserman,Karen Simonyan

Building models that can be rapidly adapted to numerous tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. Flamingo models include key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs. Thanks to their flexibility, Flamingo models can be trained on large-scale multimodal web corpora containing arbitrarily interleaved text and images, which is key to endow them with in-context few-shot learning capabilities. We perform a thorough evaluation of the proposed Flamingo models, exploring and measuring their ability to rapidly adapt to a variety of image and video understanding benchmarks. These include open-ended tasks such as visual question-answering, where the model is prompted with a question which it has to answer, captioning tasks, which evaluate the ability to describe a scene or an event, and close-ended tasks such as multiple choice visual question-answering. For tasks lying anywhere on this spectrum, we demonstrate that a single Flamingo model can achieve a new state of the art for few-shot learning, simply by prompting the model with task-specific examples. On many of these benchmarks, Flamingo actually surpasses the performance of models that are fine-tuned on thousands of times more task-specific data.

翻译：仅用少数附加说明的例子就能迅速适应许多任务的建模模型,这是多式机器学习研究的公开挑战。我们引入了Flamingo(Flamingo),这是一个具有这种能力的视觉语言模型(VLM)系列。Flammingo模型包括关键的建筑创新,目的是:(一) 连接强大的预设的、仅视像和仅语言的模型,(二) 处理任意断裂的视觉和文字数据序列,以及(三) 无缝取取图像或视频作为投入。由于这些模型的灵活性,Flammingo模型可以被培训在含有任意互换文本和图像的大型多式联运网络公司上,这是赋予这些公司在文本和图像中的拼写能力的关键。我们对拟议的Flammingo模型进行彻底评估,探索和测量其迅速适应各种图像和视频理解基准的能力。这些任务包括视觉解答,这些模型需要回答的是一个问题,说明任务,这些模型评估如何描述一个场景或事件的能力,以及近似的模型,例如多选取的图像学习能力。我们可以在任何地方用简单解析的模型,可以展示一些新的成绩。