The number of databases as well as their size and complexity is increasing. This creates a barrier to use especially for non-experts, who have to come to grips with the nature of the data, the way it has been represented in the database, and the specific query languages or user interfaces by which data are accessed. These difficulties worsen in research settings, where it is common to work with many different databases. One approach to improving this situation is to allow users to pose their queries in natural language. In this work we describe a machine learning framework, Polyglotter, that in a general way supports the mapping of natural language searches to database queries. Importantly, it does not require the creation of manually annotated data for training and therefore can be applied easily to multiple domains. The framework is polyglot in the sense that it supports multiple different database engines that are accessed with a variety of query languages, including SQL and Cypher. Furthermore Polyglotter also supports multi-class queries. Our results indicate that our framework performs well on both synthetic and real databases, and may provide opportunities for database maintainers to improve accessibility to their resources.
翻译:数据库的数量及其规模和复杂性正在增加。这为非专家特别使用数据设置了障碍,因为非专家必须了解数据的性质、数据在数据库中的代表性以及数据访问的具体查询语言或用户界面。在研究环境中,这些困难恶化,因为与许多不同的数据库合作是常见的。改善这一状况的一个办法是允许用户以自然语言提出查询。在这项工作中,我们描述了一个机器学习框架,即多格洛特,它一般地支持对数据库查询进行自然语言搜索。重要的是,它不要求人工创建附加说明的数据,用于培训,因此可以很容易地应用于多个领域。这个框架是多格罗特,因为它支持多种不同的数据库引擎,使用各种查询语言访问,包括SQL和Cypher。此外,多格洛特也支持多级查询。我们的成果表明,我们的框架在合成数据库和真实数据库上都很好地运行,并且可能为数据库维护者提供机会,以改善对其资源的无障碍性。