This note suggests two ideas that may be useful in the construction of an infrastructure for distributed indexing, searching, and browsing systems. First, I make a clear distinction between servers, which provide storage for digital objects, and collections, which organize related documents. Second, I argue that multiple, independent indexing systems may each require access to the original documents.
These ideas do not lead directly to suggestions for standards, nor are they completely original. They do offer a different perspective on the problem and place different constraints on developing standards.
Distributed information retrieval is often based on a model where many independent servers index local document collections and a directory server (or servers) guides users towards the independent indexes. This model assumes that the documents stored at a particular location define a collection.
The indexing infrastructure should allow the creation of multiple, overlapping collections that each include documents from many different servers. Here I want to use the term collection to refer to a group of related documents that share a coordinated indexing strategy; by coordinated, I mean1 that the index should be constructed based on knowledge of the collection as a whole.
In traditional information retrieval, term weights for a document are assigned using a collection-wide statistics, e.g. words occuring in only a few documents are weighted more heavily. This collection-wide information (term due to Viles and French [4]) greatly increases effectiveness and enables other useful services, like automatically constructing hierarchies with scatter/gather [2] or helping users re-formulate queries (content routing [3]).
Applying traditional term weight strategies in a distributed system is hard, because the definition of "collection-wide" can be difficult to pin down and when it is collecting the information can be expensive.
Sheldon [3] proposes a distributed IR model with the important characteristic that a collection of documents is described by a content label and the content label can itself be treated as a document and included in another collection. Content labels help users manage and explore very large information spaces, but the idea could be valuably extended by treating collections (and their labels) seperarely from servers. Thus, a collection could include particular documents from many servers. (HyPursuit [5] moves in this direction.)
Consider a simple example: Several newspapers provide servers with their articles. We could construct many collections, each with different term weightings -- business articles from each of the newspapers, articles with a San Jose dateline, or movie reviews. Different terms would be useful in each collection.
Recent work in distributed indexing has focused mostly on efficient indexing -- minimizing load on servers and keeping indexes small. This is accomplished in part by indexing surrogate for documents that includes only part of the text (in Harvest, the first 100 lines of text and the first line of later paragraphs).
There is a tension between efficient indexing and the collection-based indexing; the best choice of indexing in general isn't necessarily the best for any specific case. An indexing surrogate may omit important terms that occur late in the document or mis-represent the frequency of particular terms.
We can address this tension, in part, by creating a more flexible infrastructure that allows multiple indexing schemes to access to the full content of documents they are indexing. Where a Harvest gatherer describes a single surrogate for a document, a more flexible gatherer would generate surrogates according to a particular index's specifications.
Ideally, the system should be flexible enough to allow very different indexing schemes, including indexes that include word proximity information, n-gram based approaches that don't focus on words per see, or knowledge-based or natural language processing approaches. One possibility is for indexes to send the gatherer a program for generating document surrogates. The gatherer could run the program and return the results to the index.