This paper describes a machine learning and data science pipeline for structured information extraction from documents, implemented as a suite of open-source tools and extensions to existing tools. It centers around a methodology for extracting procedural information in the form of recipes, stepwise procedures for creating an artifact (in this case synthesizing a nanomaterial), from published scientific literature. From our overall goal of producing recipes from free text, we derive the technical objectives of a system consisting of pipeline stages: document acquisition and filtering, payload extraction, recipe step extraction as a relationship extraction task, recipe assembly, and presentation through an information retrieval interface with question answering (QA) functionality. This system meets computational information and knowledge management (CIKM) requirements of metadata-driven payload extraction, named entity extraction, and relationship extraction from text. Functional contributions described in this paper include semi-supervised machine learning methods for PDF filtering and payload extraction tasks, followed by structured extraction and data transformation tasks beginning with section extraction, recipe steps as information tuples, and finally assembled recipes. Measurable objective criteria for extraction quality include precision and recall of recipe steps, ordering constraints, and QA accuracy, precision, and recall. Results, key novel contributions, and significant open problems derived from this work center around the attribution of these holistic quality measures to specific machine learning and inference stages of the pipeline, each with their performance measures. The desired recipes contain identified preconditions, material inputs, and operations, and constitute the overall output generated by our computational information and knowledge management (CIKM) system.
翻译:本文介绍了一种机器学习和数据科学管道,用于结构化地从文件中提取信息,这是一套开放源码工具,也是现有工具的延伸;它围绕一种方法,从出版的科学文献中提取程序信息,以食谱、制作文物(在此情况下合成纳米材料)的渐进程序等形式,根据我们从自由文本中产生食谱的总体目标,我们从一个由编审阶段组成的系统的技术目标出发:文件的获取和过滤、有效载荷提取、作为关系提取前缀的配方提取、配方组装、通过信息检索接口与问题解答功能的展示;这个系统满足了以制导有效载器提取、命名实体提取和从文本中提取关系的计算信息和知识管理要求;本文描述的职能贡献包括半超导的机器学习方法,从自由文本中产生配方过滤和有效载提取任务,然后是结构化的提取和数据转换任务,作为信息图案的提取、配方装配方组组装,最后是编集的配方。 提取质量的可计量的客观标准包括精准和回顾制步骤、定法步骤、定序中的每一阶段、精准和结果。