The CNIDR ISEARCH Text Searching System
Erik Scott, Scott Technologies, Inc.
Archie Warnock, A/WWW Enterprises
Features of the 1.13 Release
Isearch is a software system for searching though large amounts of
text. The system allows a user to very quickly find out what documents
are available that contain certain words. Unlike older search systems,
Isearch does not use a list of keywords or an abstract; every word of
every document can be checked. This allows greatly improved chances of
discovering new information in old collections.
As an example, consider this real-world example: CNIDR uses Isearch
to index and search a collection of over 2000 AIDS-related patents
issued by the U.S. Patent and Trademark Office. This collection of XXX
megabytes of raw text can be searched in less than 1 second. A
researcher looking for patents containing either the word
"needle" or the word "syringe" can submit the query
and get results back about as fast as his desktop machine can display
- Searches large collections using a Free-Text search: no reliance
on keywords, abstracts, or human-generated indexes.
- Handles very large collections: over 1 gigabyte (1 million
megabyte) collections can be handled on modest servers. Essentially
unlimited textbases can be searched with careful layout and planning.
- Very sophisticated result sorting: The documents most likely to be
useful are returned first. Ranking is based on statistical analysis of
word frequencies and is generalized for a wide variety of subjects and
user skill levels.
- Fast: documents are machine-indexed before searching, so
non-matching documents needn't be read in. Fast enough to make optical
media a reasonable solution, and extremely responsive with cheap SCSI
- Works well with OCR document storage and retrieval systems: no
need for people to classify documents, and the statistical ranking
method is forgiving of OCR errors. Potentially millions of pages can be
made searchable for little more than photocopy costs.
- Handles a wide range of document types: can handle text in formats
from raw ASCII dumps to richly formatted SGML. Convenient doctype
interface allows handling of entirely new and unusual formats in a
matter of hours. Good supply of free and commercial doctypes available
from third parties.
- Efficient use of disk resources: Indexes are relatively compact,
generally smaller than the original collection, and yet contain
references to every word in the textbase.
- Text maintenance commands: old documents can be deleted instantly
and new data can be added without having to re-index the entire
- Portable and Scalable: works well on Unix machines from Linux PCs
to Crays. Takes advantage of Very Large Memory (VLM) technology for
Digital AlphaServers. Support for Windows NT in 3Q96.
- Integrates smoothly with World Wide Web (WWW) and ANSI Z39.50
servers: Anyone can search an Isearch textbase using their favorite web
browser. When used with CNIDR's Isite package, Isearch can be used
through a Z39.50 session to interoperate with library automation
software. Isearch and Isite together form a three-tier client-server
architecture to allow essentially unlimited capacity growth.
- Easy to customize: The modular, object-oriented structure of
Isearch means that new features can be added independently of the
Isearch core. Third party extension is facilitated by using
well-defined Application Programming Interfaces (APIs) implemented in