The growth rate in the amount of biomedical documents is staggering. Unlocking information trapped in these documents can enable researchers and practitioners to operate confidently in the information world. Biomedical NER, the task of recognising biomedical names, is usually employed as the first step of the NLP pipeline. Standard NER models, based on sequence tagging technique, are good at recognising short entity mentions in the generic domain. However, there are several open challenges of applying these models to recognise biomedical names: 1) Biomedical names may contain complex inner structure (discontinuity and overlapping) which cannot be recognised using standard sequence tagging technique; 2) The training of NER models usually requires large amount of labelled data, which are difficult to obtain in the biomedical domain; and, 3) Commonly used language representation models are pre-trained on generic data; a domain shift therefore exists between these models and target biomedical data. To deal with these challenges, we explore several research directions and make the following contributions: 1) we propose a transition-based NER model which can recognise discontinuous mentions; 2) We develop a cost-effective approach that nominates the suitable pre-training data; and, 3) We design several data augmentation methods for NER. Our contributions have obvious practical implications, especially when new biomedical applications are needed. Our proposed data augmentation methods can help the NER model achieve decent performance, requiring only a small amount of labelled data. Our investigation regarding selecting pre-training data can improve the model by incorporating language representation models, which are pre-trained using in-domain data. Finally, our proposed transition-based NER model can further improve the performance by recognising discontinuous mentions.
翻译:生物医学文件数量的增长速度令人吃惊。这些文件中不固定的信息可以让研究人员和从业者在信息世界中自信地运作。生物医学净入学率,即承认生物医学名称的任务,通常作为NLP管道的第一步使用。基于序列标记技术的标准净入学率模型,在通用领域可以很好地识别短实体。然而,在应用这些模型承认生物医学名称方面存在一些公开的挑战:1)生物医学名称可能包含复杂的内部结构(不连续和重叠),而这种结构无法使用标准序列标记技术得到承认;2)NER模型的培训通常需要大量贴标签的数据,而这些数据在生物医学领域很难获得;3)通用语言代表模式在通用数据方面经过预先培训;因此,这些模型与目标生物医学数据之间存在领域的变化。为了应对这些挑战,我们探索了若干研究方向,并做出以下贡献:1)我们建议基于过渡的NER模型(不连续和重叠)可能包含复杂的内结构;2)我们制定具有成本效益的方法,指定适当的培训前模型;2)NER模型通常需要大量贴标签的数据;以及3)通用语言代表模式,因此,我们设计一些实际的升级数据升级数据应用方法,我们需要。