Result Merging in Distributed Indexing

Jan Pedersen and Hinrich Schuetze, Palo Alto Research Center, Xerox Corporation

Distributed search systems should be able to return ranked result lists as well as the unranked result sets that are common in Boolean systems and Harvest/Glimpse. Ranking is difficult in a distributed environment because simply combining results across servers is known to be inferior to the ranking that would be achieved if the documents where centralized. This is because ranking strategies make use of collection statistics which, in the case of a meta-collection composed of a number of distributed collections, are not available to each individual server. Some systems try to circumvent the need for statistical weighting by ranking according to the number of term matches (e.g. WAIS). However, it is well-known in the information retrieval literature that unweighted searches perform worse than weighted searches [1]. To address this problem, we believe that two-way communication of collection statistics must be included in standards for effective distributed searching.

Most term weighting methods in information retrieval rely on the collection document frequency of a term, i.e. the number of documents in the full collection in which that term occurs. When a user creates a meta-collection on the fly by selecting a number of collection servers to search over, a term's document frequency for the meta-collection can be computed by summing the document frequencies from the individual collections. The protocol should therefore enable the client to obtain the document frequencies of the current query terms from the individual collections. However, unless that information is communicated back to each individual server, the individual servers will not have the information necessary to form optimal weights. Instead they must rely on local statistics, which produce inferior results to global statistics [2]. The protocol should therefore allow for the communication back to each server of global collection statistics to be used in weighting. Finally, in order for the scores from different servers to be comparable, the same scoring function has to be used by each server. We propose that the most common weighting functions be specified in the standard and that the protocol allow the client to select which weighting function to use on a particular search. The following tf.idf formula is an example of one of the common weighting function in information retrieval:

sum over all terms i shared by query and document:
log(N / N(i)) * square-root(tf(i,document)) * square-root(tf(i,query))

where N is the total number of documents in the (meta-)collection, N(i) is the document frequency of term i, tf(i,document) is the term frequency of term i in the document, tf(i,query) is the term frequency of term i in the query and both document vector and query vector are assumed to be cosine-normalized.

In summary, we argue that results from information retrieval research motivate an extension of the protocols between clients and servers to support effective multi-site ranked searches. The standard protocol should allow clients

to obtain characteristic statistics about each server's collection
to select from a standard set of scoring functions and
to send term statistics on the current meta-collection to a server (along with a particular query)

[1] Salton and Buckley. Term-weighting Approaches in Automatic Text Retrieval. IP&M Vol. 24, No. 5, pp. 513-523, 1988.

[2] J.P. Callan, Zhihong Lu and W. Bruce Croft. Searching Distributed Collections with Inference Networks. SIGIR 95, pp. 21-28, 1995.