新闻版:通过培训前文件代表制制作模拟新闻 (NewsEmbed: Modeling News through Pre-trained Document Representations)

Effectively modeling text-rich fresh content such as news articles at document-level is a challenging problem. To ensure a content-based model generalize well to a broad range of applications, it is critical to have a training dataset that is large beyond the scale of human labels while achieving desired quality. In this work, we address those two challenges by proposing a novel approach to mine semantically-relevant fresh documents, and their topic labels, with little human supervision. Meanwhile, we design a multitask model called NewsEmbed that alternatively trains a contrastive learning with a multi-label classification to derive a universal document encoder. We show that the proposed approach can provide billions of high quality organic training examples and can be naturally extended to multilingual setting where texts in different languages are encoded in the same semantic space. We experimentally demonstrate NewsEmbed's competitive performance across multiple natural language understanding tasks, both supervised and unsupervised.

翻译：有效建模文本丰富的新内容,如文件级的新闻报道等,是一个具有挑战性的问题。为了确保基于内容的模型能够广泛推广到广泛的应用范围,关键是要有一个培训数据集,该数据集大大超出人类标签的规模,同时达到理想的质量。在这项工作中,我们通过提出一种新颖的方法来应对这两个挑战,在人类很少监督的情况下,对与字义相关的新文件及其主题标签提出一种新的方法。与此同时,我们设计了一个称为“新闻”的多任务模型,用以用多标签分类来进行对比学习,以产生一个通用的文件编码。我们表明,拟议的方法可以提供数十亿个高质量的有机培训范例,并且可以自然地推广到不同语言的文本被编码在同一语义空间的多语种环境中。我们实验性地展示了NewEmbed在多种自然语言理解任务中的竞争性表现,既有受监督的也有不受监督的,也有不受监督的。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【干货书】机器学习速查手册，135页pdf

专知会员服务

127+阅读 · 2020年11月20日

【干货书】'Mastering Go 第二版中文版'，143页pdf

专知会员服务

48+阅读 · 2020年11月1日

【论文推荐】文本摘要简述

专知会员服务

69+阅读 · 2020年7月20日

【Manning新书】现代Java实战，592页pdf

专知会员服务

101+阅读 · 2020年5月22日