Bench4KE：自动化能力问题生成的基准测试 (Bench4KE: Benchmarking Automated Competency Question Generation)

The availability of Large Language Models (LLMs) presents a unique opportunity to reinvigorate research on Knowledge Engineering (KE) automation. This trend is already evident in recent efforts developing LLM-based methods and tools for the automatic generation of Competency Questions (CQs), natural language questions used by ontology engineers to define the functional requirements of an ontology. However, the evaluation of these tools lacks standardization. This undermines the methodological rigor and hinders the replication and comparison of results. To address this gap, we introduce Bench4KE, an extensible API-based benchmarking system for KE automation. The presented release focuses on evaluating tools that generate CQs automatically. Bench4KE provides a curated gold standard consisting of CQ datasets from 17 real-world ontology engineering projects and uses a suite of similarity metrics to assess the quality of the CQs generated. We present a comparative analysis of 6 recent CQ generation systems, which are based on LLMs, establishing a baseline for future research. Bench4KE is also designed to accommodate additional KE automation tasks, such as SPARQL query generation, ontology testing and drafting. Code and datasets are publicly available under the Apache 2.0 license.

翻译：大型语言模型（LLMs）的出现为知识工程（KE）自动化研究注入了新的活力。这一趋势已在近期开发基于LLM的方法与工具、用于自动生成能力问题（CQs）的努力中显现——能力问题是本体工程师用于定义本体功能需求的自然语言问题。然而，这些工具的评估缺乏标准化，这削弱了方法的严谨性，并阻碍了结果的复现与比较。为填补这一空白，我们提出了Bench4KE，一个基于API的可扩展KE自动化基准测试系统。当前版本专注于评估自动生成CQs的工具。Bench4KE提供了一个精心构建的黄金标准，包含来自17个真实世界本体工程项目的CQ数据集，并采用一套相似性度量指标来评估生成CQs的质量。我们对6个基于LLM的最新CQ生成系统进行了比较分析，为未来研究建立了基线。Bench4KE还设计用于容纳其他KE自动化任务，如SPARQL查询生成、本体测试与草稿编写。代码和数据集均在Apache 2.0许可下公开提供。