Search Gateways and Indexing Engines for Web sites.

Indexing Engines for Web sites.

[ This is a very old document. For a more recent review, see Comparing Open Source Indexers. Also see the Java World article on Lucene]

There are several tools available that make indexing a site relatively easy. It is no longer sensible to write your own search gateway; it makes a lot more sense to use somebody else's.

Probably the most widely used indexing software is Excite for Web Servers. It produces high quality results; the search interface can be customised; it is very easy to set up; and it is free.

Other indexing software in the public domain includes Harvest and its derivatives and Swish.

A partial list of indexing software follows.

Excite for Web Servers (EWS)
EWS is an application webmasters and web server administrators can download and install on their web servers. EWS provides intelligent, concept-based searching of the HTML and ASCII documents which are locally stored on their web server. You must be root (superuser) on your system to run Excite for Web Servers.

The Harvest Information Discovery and Access System
An integrated set of tools to gather, extract, organise, search, cache, and replicate relevant information across the Internet. With modest effort users can tailor Harvest to digest information in many different formats from many different machines, and offer custom search services on the web. Netscape's Catalog Server is based on the Harvest design.

Glimpse
a very powerful indexing and query system that allows you to search through all your files very quickly. It can be used by individuals for their personal file systems as well as by organizations for large data collections. Glimpse is the default search engine in Harvest.

WebGlimpse
WebGlimpse adds search capabilities to your WWW site automatically and easily. It attaches a small search box to the bottom of every HTML page, and allows the search to cover the neighborhood of that page or the whole site. With WebGlimpse there is no need to construct separate search pages, and no need to interrupt the users from their browsings. All pages remain unchanged except for the extra search capabilities. It is even possible for the search to efficiently cover remote pages linked from your pages. (WebGlimpse will collect such remote pages to your disk and index them.) Installation, customization (e.g., deciding which pages to collect and which ones to index), and maintenance are easy.

SWISH
SWISH stands for Simple Web Indexing System for Humans. With it, you can index directories of files and search the generated indexes. SWISH was created to fill the need of the growing number of Web administrators on the Internet - many current indexing systems are not well documented, are hard to use and install, and are too complex for their own good. Written in C.