Nutch –Based Search Engine Implementation and Chinese extension
Abstract
Search engine is the internet tool meeting demands of people while surfing on the internet and searching the information. It is a Internet Information navigation and bridge between internet user and information. However, with the sharply increase of the net content and the surprisingly change of the Synchronized forms of content, search engine can not satisfy increasingly critical user’s all kinds of search demands, although Web search is the foundation of the internet Roaming ,the existing number of search engine is down.This phenomenon can easily became one company almost monopolized all web search for its commercial gain. Therefore ,a strong and useful and effective search tool rise to the hope focus of internet user.Nutch is such search engine, when Nutch aims to text indexing, it uses the revenue Lucene toolkit for full-text indexing. Through the second Nutch development we can make use of its powerful internet resource Collection Function to collect the resource we need, then put the processed information into local database, finally, user can directly face effective information.
In this paper, we emphasize on the implementation architecture of the Nutch, Search engine principle,webpage crawling process. Excepting the in-depth research and analysis about above, we also give the solution of how to support Chinese and Chinese segmentation on the basis of earlier versions. Finally, a discussion about the application based on Nutch is given.
Key words:Search engine, crawler, Nutch, Chinese segmentation