The domains of data mining and knowledge discovery make use of large amounts of textual data, which need to be handled efficiently. Specific problems, like finding the maximum weight ordered common subset of a set of ordered sets or searching for specific patterns within texts, occur frequently in this context. In this paper we present several novel and practical algorithmic techniques for processing textual data (strings) in order to efficiently solve multiple problems. Our techniques make use of efficient string algorithms and data structures, like KMP, suffix arrays, tries and deterministic finite automata.
翻译:数据挖掘和知识发现领域利用大量需要高效处理的文本数据,在这方面经常出现一些具体问题,如找到一组有定单数据集的最大重量定额共同子集或搜索文本中的具体模式,在本文件中我们介绍了处理文本数据(字符串)的几种新颖和实用的算法技术,以便有效解决多种问题。我们的技术利用高效的字符串算法和数据结构,如 KMP、 后缀阵列、 尝试和确定性有限的自动数据。