The key to success in automating prior art search in patent research using artificial intelligence lies in developing large datasets for machine learning and ensuring their availability. This work is dedicated to providing a comprehensive solution to the problem of creating infrastructure for research in this field, including datasets and tools for calculating search quality criteria. The paper discusses the concept of semantic clusters of patent documents that determine the state of the art in a given subject, as proposed by the authors. A definition of such semantic clusters is also provided. Prior art search is presented as the task of identifying elements within a semantic cluster of patent documents in the subject area specified by the document under consideration. A generator of user-configurable datasets for machine learning, based on collections of U.S. and Russian patent documents, is described. The dataset generator creates a database of links to documents in semantic clusters. Then, based on user-defined parameters, it forms a dataset of semantic clusters in JSON format for machine learning. To evaluate machine learning outcomes, it is proposed to calculate search quality scores that account for semantic clusters of the documents being searched. To automate the evaluation process, the paper describes a utility developed by the authors for assessing the quality of prior art document search.
翻译:利用人工智能实现专利研究中的现有技术检索自动化的关键在于开发用于机器学习的大规模数据集并确保其可用性。本工作致力于为创建该领域研究基础设施的问题提供全面解决方案,包括数据集和用于计算检索质量标准的工具。本文讨论了作者提出的、用于确定特定主题领域技术发展水平的专利文档语义簇概念,并提供了此类语义簇的定义。现有技术检索被表述为一项任务:在由待审文档指定的主题领域内,识别专利文档语义簇中的元素。本文描述了一个基于美国和俄罗斯专利文档集合、可由用户配置的机器学习数据集生成器。该数据集生成器创建了一个指向语义簇中文档链接的数据库,随后根据用户定义的参数,生成用于机器学习的JSON格式语义簇数据集。为评估机器学习结果,建议计算能够反映被检索文档语义簇的检索质量分数。为实现评估过程自动化,本文描述了作者开发的用于评估现有技术文档检索质量的实用工具。