We propose a novel method for generating titles for unstructured text documents. We reframe the problem as a sequential question-answering task. A deep neural network is trained on document-title pairs that have the property of decomposability, in which the vocabulary of the document title is a subset of the vocabulary of the document body. To train the model we use a corpus of millions of publicly available document-title pairs: news articles and headlines. We present the results of a randomized double-blind trial in which subjects were unaware of which titles were human or machine-generated. When trained on approximately 1.5 million news articles, the model generates headlines that humans judge to be as good or better than the original human-written headlines in the majority of cases.
翻译:我们提出了为无结构文本文件制作标题的新颖方法。我们将这一问题重新描述为一项连续的问答任务。一个深层的神经网络接受了关于具有分解特性的文件标题对的训练,其中文件标题的词汇是文件主体词汇的子集。为了培训模型,我们使用了成百上千万个公开的、文件标题对的系列:新闻文章和标题。我们介绍了一个随机的双盲试验的结果,试验对象不知道哪些标题是人类的或机器产生的。当接受大约150万篇新闻文章的培训时,模型产生了头条标题,人类判断这些标题比大多数案件的原人类写头条好或好。