COMMERCIAL AND INDUSTRIAL MATHEMATICS

April 18, 2024

Seminar Series


October 25, 1999
Topic: Document Resemblance and Related Issues
Speaker: Andrei Broder, Chief Technology Officer at AltaVista

ABSTRACT:

People often claim that two web pages are "the same" or "roughly the same", even though classic distances on strings (Hamming, Levenshtein, etc.) might indicate that the two pages are far apart.

To formalize these intuitive ideas we defined the mathematical concept of document resemblance. The resemblance can be estimated using a fixed size "sketch" for each document. For a large collection of documents (say 200 million) the size of this sketch is of the order of a few hundred bytes per document. However, for efficient large scale web indexing it is not necessary to determine the actual resemblance value: it suffices to determine whether newly encountered documents are duplicates or near-duplicates of documents already indexed; In other words, it suffices to determine whether the resemblance is above a certain threshold. We show how this determination can be made using a "sample" of less than 50 bytes per document.

The basic approach for computing resemblance has two aspects: first, resemblance is expressed as a set intersection problem, and second, the relative size of intersections is evaluated by a process of random sampling that can be done independently for each document. The process of estimating the relative size of intersection of sets and the threshold test discussed above can be applied to arbitrary sets, and thus might be of independent interest.

The ideas for filtering near-duplicate documents discussed here have been successfully implemented and are in current use in the context of the AltaVista search engine. This talk is tilted towards "algorithm engineering" rather than "algorithm analysis" and very little mathematical background is required.

BIOGRAPHICAL SKETCH

Andrei Broder is chief technology officer of the AltaVista Search division in the AltaVista Company. Previously he was a senior member of the research staff at Compaq's Systems Research Center in Palo Alto, California. He graduated from Technion, Israel's Institute of Technology, and did his Ph.D. in Computer Science at Stanford University under Don Knuth. He has written and co-authored more than 60 scientific papers and numerous patents. His main research interests are the design, analysis, and implementation of probabilistic algorithms and supporting data structures, in particular in the context of web-scale applications.