The rapid development of science and technology has been accompanied by an exponential growth in peer-reviewed scientific publications. At the same time, the review of each paper is a laborious process that must be carried out by subject matter experts. Thus, providing high-quality reviews of this growing number of papers is a significant challenge. In this work, we ask the question "can we automate scientific reviewing?", discussing the possibility of using state-of-the-art natural language processing (NLP) models to generate first-pass peer reviews for scientific papers. Arguably the most difficult part of this is defining what a "good" review is in the first place, so we first discuss possible evaluation measures for such reviews. We then collect a dataset of papers in the machine learning domain, annotate them with different aspects of content covered in each review, and train targeted summarization models that take in papers to generate reviews. Comprehensive experimental results show that system-generated reviews tend to touch upon more aspects of the paper than human-written reviews, but the generated text can suffer from lower constructiveness for all aspects except the explanation of the core ideas of the papers, which are largely factually correct. We finally summarize eight challenges in the pursuit of a good review generation system together with potential solutions, which, hopefully, will inspire more future research on this subject. We make all code, and the dataset publicly available: https://github.com/neulab/ReviewAdvisor, as well as a ReviewAdvisor system: http://review.nlpedia.ai/.
翻译:科技的快速发展伴随着经同行评审的科学出版物的快速增长。 同时,对每份文件的审查是一个艰巨的过程,必须由专题专家进行。 因此,对越来越多的论文进行高质量的审查是一项重大挑战。 在这项工作中,我们询问“我们能否实现科学审查的自动化? ”, 讨论利用最新自然语言处理(NLP)模式为科学论文产生第一流的同行审议的可能性。 可以说,其中最困难的部分是确定“良好”审查首先是什么,因此我们首先讨论这类审查可能的评价措施。 然后,我们在机器学习领域收集一套文件的数据集,说明每份审查涉及的内容的不同方面,并培训有针对性的总结模型,在文件中进行评论。 全面的实验结果显示,系统生成的审查往往触及文件的更多方面,而不是人文评论,但所产生的案文在各个方面的建设性程度较低,除了解释文件的核心想法之外,我们首先讨论此类审查可能采取的评估措施。 我们最后要对数据进行事实性评估,我们最后要对8个数据进行事实性评估。