Effectively modeling text-rich fresh content such as news articles at document-level is a challenging problem. To ensure a content-based model generalize well to a broad range of applications, it is critical to have a training dataset that is large beyond the scale of human labels while achieving desired quality. In this work, we address those two challenges by proposing a novel approach to mine semantically-relevant fresh documents, and their topic labels, with little human supervision. Meanwhile, we design a multitask model called NewsEmbed that alternatively trains a contrastive learning with a multi-label classification to derive a universal document encoder. We show that the proposed approach can provide billions of high quality organic training examples and can be naturally extended to multilingual setting where texts in different languages are encoded in the same semantic space. We experimentally demonstrate NewsEmbed's competitive performance across multiple natural language understanding tasks, both supervised and unsupervised.
翻译:有效建模文本丰富的新内容,如文件级的新闻报道等,是一个具有挑战性的问题。为了确保基于内容的模型能够广泛推广到广泛的应用范围,关键是要有一个培训数据集,该数据集大大超出人类标签的规模,同时达到理想的质量。在这项工作中,我们通过提出一种新颖的方法来应对这两个挑战,在人类很少监督的情况下,对与字义相关的新文件及其主题标签提出一种新的方法。与此同时,我们设计了一个称为“新闻”的多任务模型,用以用多标签分类来进行对比学习,以产生一个通用的文件编码。我们表明,拟议的方法可以提供数十亿个高质量的有机培训范例,并且可以自然地推广到不同语言的文本被编码在同一语义空间的多语种环境中。我们实验性地展示了NewEmbed在多种自然语言理解任务中的竞争性表现,既有受监督的也有不受监督的,也有不受监督的。