Thousands of programmers worldwide are quietly tapping away on their keyboards trying to build the next best search engine. Sergey Brin and Lawrence Page, the famous creators of Google, admit that "engineering a search engine is a challenging task." A search engine is used to find information on the Web. An engine crawls the Internet and indexes millions of pages of information, spitting out results when someone does a search.
Get a Web Crawler
Acquire a Web crawler, which is the spider or bot that crawls around the Internet collecting pages from the Web. A spider visits Web pages, reads them and follows links to other pages. You can find an open-source crawler or build your own. If you want to build your own crawler, get a list of URLs to seed your crawler with. A slow crawler is easy to build, but building a high-performance crawler to index millions and millions of pages is more challenging.
Video of the Day
Get as much bandwidth as you can afford. You need this bandwidth for your crawler as it travels across the Web getting pages.
Build an index. Everything your crawler finds goes into the search engine index. The index is like a giant book or catalog containing a copy of every Web page that the crawler finds. Anna Patterson from Stanford University recommends indexing only the data you need to serve your kind of search results. She also advises that you shouldn't try to index "the kitchen sink" but rather "get something presentable up."
Rank your results on the index using a high-performance database and all the information on your servers from your Web crawling. You need to process possibly millions of Web pages to create your index. The pages recorded in your index need to be ranked in order of what is most relevant to your searchers.
Build an attractive website to return search results.
Launch and market your search engine. A free search engine should take users where they want to go quickly and elegantly, according to Laszlo Xalieri from Search Engine Watch.com. He says that to run a successful search engine, "your goal is to attract consumers and sell access to them to marketers."
- Search Engine Watch: The Future of Search; Laszlo Xalieri
- Association for Computing Machinery: Why Writing Your Own Search Engine is Hard; Anna Patterson
- Stanford University: The Anatomy of a Large-Scale Hypertextual Web Search Engine; Sergey Brin and Lawrence Page
- Heritrix Open Source Web Crawler
- Grub Open Source Web Crawler