As the web expands in data volume and in geographical distribution, centralized search methods become inefficient, leading to increasing interest in cooperative information retrieval, e.g., federated text retrieval (FTR). Different from existing centralized information retrieval (IR) methods, in which search is done on a logically centralized document collection, FTR is composed of a number of peers, each of which is a complete search engine by itself. To process a query, FTR requires firstly the identification of promising peers that host the relevant documents and secondly the retrieval of the most relevant documents from the selected peers. Most of the existing methods only apply traditional IR techniques that treat each text collection as a single large document and utilize term matching to rank the collections. In this paper, we formalize the problem and identify the properties of FTR, and analyze the feasibility of extending LSI with clustering to adapt to FTR, based on which a novel approach called Cluster-based Distributed Latent Semantic Indexing (C-DLSI) is proposed. C-DLSI distinguishes the topics of a peer with clustering, captures the local LSI spaces within the clusters, and consider the relations among these LSI spaces, thus providing more precise characterization of the peer. Accordingly, novel descriptors of the peers and a compatible local text retrieval are proposed. The experimental results show that C-DLSI outperforms existing methods.
翻译:随着网络在数据量和地理分布方面的扩展,集中搜索方法变得效率低下,导致对合作信息检索的兴趣日益浓厚,例如,联合文本检索。 与现有的集中信息检索方法不同,现有集中信息检索方法在逻辑集中的文件收集上搜索,FTR由若干同行组成,每个同行都是完整的搜索引擎。为了处理查询,FTR首先要求确定主办相关文件的有希望的同行,其次是从选定的同行那里检索最相关的文件。大多数现有方法仅采用传统IR技术,将每个文本收藏作为单大文件处理,并使用术语匹配收藏品的排名。在本文件中,我们将问题正规化,确定FTR的特性,并分析扩大LSI的集群以适应FTR的可行性,并在此基础上建议采用新的基于集群的分散式Lett-Smant索引(C-DLSI)方法。C-DLSI将同行群集的主题与主题区分,捕捉到本地的LSI空间,利用匹配的术语来对收藏品进行排名。在本文件中,我们将问题加以正式化,并分析扩大LSI的分组的现有同行检索结果。