We develop and test a novel unsupervised algorithm for word sense induction and disambiguation which uses topological data analysis. Typical approaches to the problem involve clustering, based on simple low level features of distance in word embeddings. Our approach relies on advanced mathematical concepts in the field of topology which provides a richer conceptualization of clusters for the word sense induction tasks. We use a persistent homology barcode algorithm on the SemCor dataset and demonstrate that our approach gives low relative error on word sense induction. This shows the promise of topological algorithms for natural language processing and we advocate for future work in this promising area.
翻译:我们开发并测试一种新颖的、不受监督的单词感应和脱节算法,该算法使用地形学数据分析。 问题的典型方法包括基于简单的低水平文字嵌入距离特征的集群。 我们的方法依赖于在地形学领域的高级数学概念,它为感应感应任务提供了更丰富的集群概念化。 我们在SemCor数据集中使用了一种持续的同族学条码算法,并表明我们的方法在单词感应征上给出的相对错误较低。 这显示了自然语言处理的顶级算法的希望,并倡导在这一有希望的领域今后开展工作。