In spite of the remarkable advancements in the field of Natural Language Processing, the task of Entity Linking (EL) remains challenging in the field of humanities due to complex document typologies, lack of domain-specific datasets and models, and long-tail entities, i.e., entities under-represented in Knowledge Bases (KBs). The goal of this paper is to address these issues with two main contributions. The first contribution is DELICATE, a novel neuro-symbolic method for EL on historical Italian which combines a BERT-based encoder with contextual information from Wikidata to select appropriate KB entities using temporal plausibility and entity type consistency. The second contribution is ENEIDE, a multi-domain EL corpus in historical Italian semi-automatically extracted from two annotated editions spanning from the 19th to the 20th century and including literary and political texts. Results show how DELICATE outperforms other EL models in historical Italian even if compared with larger architectures with billions of parameters. Moreover, further analyses reveal how DELICATE confidence scores and features sensitivity provide results which are more explainable and interpretable than purely neural methods.
翻译:尽管自然语言处理领域取得了显著进展,但在人文学科中,由于文档类型复杂、缺乏领域特定数据集与模型,以及长尾实体(即在知识库中代表性不足的实体)的存在,实体链接任务仍具挑战性。本文旨在通过两项主要贡献解决这些问题。第一项贡献是DELICATE,一种针对历史意大利语的新型神经符号实体链接方法,它结合了基于BERT的编码器与来自Wikidata的上下文信息,利用时间合理性和实体类型一致性筛选合适的知识库实体。第二项贡献是ENEIDE,一个从19世纪至20世纪涵盖文学与政治文本的两个注释版本中半自动提取的历史意大利语多领域实体链接语料库。实验结果表明,即使与具有数十亿参数的更大规模架构相比,DELICATE在历史意大利语上的性能仍优于其他实体链接模型。此外,进一步分析显示,DELICATE的置信度分数与特征敏感性提供了比纯神经方法更具可解释性和可理解性的结果。