Today World Wide Web (WWW) has become a huge ocean of information and it is growing in size everyday. Downloading even a fraction of this mammoth data is like sailing through a huge ocean and it is a challenging task indeed. In order to download a large portion of data from WWW, it has become absolutely essential to make the crawling process parallel. In this paper we offer the architecture of a dynamic parallel Web crawler, christened as "WEB-SAILOR," which presents a scalable approach based on Client-Server model to speed up the download process on behalf of a Web Search Engine in a distributed Domain-set specific environment. WEB-SAILOR removes the possibility of overlapping of downloaded documents by multiple crawlers without even incurring the cost of communication overhead among several parallel "client" crawling processes.
翻译:今天的万维网(WWW)已经成为一个巨大的信息海洋,而且它每天都在扩大。下载甚至这一长毛象数据中的一小部分就像在巨大的海洋中航行一样,这的确是一项艰巨的任务。为了下载WWW的大量数据,将爬行过程平行化已经变得绝对必要。在本文中,我们提供了动态平行的网络爬行器的结构,这个结构以“WEB-SAILR”为名,它以客户服务员模式为基础,为分布式域位特定环境中的网络搜索引擎加快下载过程。WEB-SAILR排除了多个爬行者重复下载文件的可能性,甚至没有在多个平行的“客户”爬行程序之间承担通信费。