DOCENT: 从大型文件集学习自我监督的实体代表 (DOCENT: Learning Self-Supervised Entity Representations from Large Document Collections)

This paper explores learning rich self-supervised entity representations from large amounts of the associated text. Once pre-trained, these models become applicable to multiple entity-centric tasks such as ranked retrieval, knowledge base completion, question answering, and more. Unlike other methods that harvest self-supervision signals based merely on a local context within a sentence, we radically expand the notion of context to include any available text related to an entity. This enables a new class of powerful, high-capacity representations that can ultimately distill much of the useful information about an entity from multiple text sources, without any human supervision. We present several training strategies that, unlike prior approaches, learn to jointly predict words and entities -- strategies we compare experimentally on downstream tasks in the TV-Movies domain, such as MovieLens tag prediction from user reviews and natural language movie search. As evidenced by results, our models match or outperform competitive baselines, sometimes with little or no fine-tuning, and can scale to very large corpora. Finally, we make our datasets and pre-trained models publicly available. This includes Reviews2Movielens (see https://goo.gle/research-docent ), mapping the up to 1B word corpus of Amazon movie reviews (He and McAuley, 2016) to MovieLens tags (Harper and Konstan, 2016), as well as Reddit Movie Suggestions (see https://urikz.github.io/docent ) with natural language queries and corresponding community recommendations.

翻译：本文探索从大量相关文本中学习丰富的自我监督实体代表。一旦经过预先培训, 这些模型将适用于多个实体中心任务, 如排名检索、知识基础完成、回答问题等等。不同于仅仅根据本地背景在句子内采集自监督信号的其他方法, 我们大幅扩展上下文概念, 包括与某个实体有关的任何现有文本。这样可以形成一个新的强大、高能力代表类别, 最终从多个文本源中提取大量关于某个实体的有用信息, 而无需任何人类监督。我们推出一些培训战略, 与先前的做法不同, 学习共同预测文字和实体 -- -- 我们在电视- 电影域的下游任务上进行实验性比较的战略, 比如从用户评论和自然语言搜索获得的MeephLens标记。正如结果所证明的, 我们的模式匹配或超过竞争性基准, 有时很少或没有微调, 并且可以推广到非常庞大的子公司。最后, 我们将我们的数据集和预培训模式公开提供。这包括Reviews2Movielenses (见https://goalAgle/Hardodologio crial revisional relial reviews reflial reflial reflial refliviews) (见https) asionalalalalalal Aview.