Fields Institute - Industrial Seminar Series


Home About Us People & Contacts Programs & Activities Thematic & Focus Programs General Scientific Activity Commercial & Industrial Programs Centre for Mathematical Medicine Mathematics Education Outreach Calendar of Events Mailing List FieldsLive Video Archive Proposals & Applications Honours, Prizes & Fellowships Publications Resources and Facilities			COMMERCIAL AND INDUSTRIAL MATHEMATICS
		April 18, 2024
			Seminar Series October 25, 1999 Topic: Document Resemblance and Related Issues Speaker: Andrei Broder, Chief Technology Officer at AltaVista ABSTRACT: People often claim that two web pages are "the same" or "roughly the same", even though classic distances on strings (Hamming, Levenshtein, etc.) might indicate that the two pages are far apart. To formalize these intuitive ideas we defined the mathematical concept of document resemblance. The resemblance can be estimated using a fixed size "sketch" for each document. For a large collection of documents (say 200 million) the size of this sketch is of the order of a few hundred bytes per document. However, for efficient large scale web indexing it is not necessary to determine the actual resemblance value: it suffices to determine whether newly encountered documents are duplicates or near-duplicates of documents already indexed; In other words, it suffices to determine whether the resemblance is above a certain threshold. We show how this determination can be made using a "sample" of less than 50 bytes per document. The basic approach for computing resemblance has two aspects: first, resemblance is expressed as a set intersection problem, and second, the relative size of intersections is evaluated by a process of random sampling that can be done independently for each document. The process of estimating the relative size of intersection of sets and the threshold test discussed above can be applied to arbitrary sets, and thus might be of independent interest. The ideas for filtering near-duplicate documents discussed here have been successfully implemented and are in current use in the context of the AltaVista search engine. This talk is tilted towards "algorithm engineering" rather than "algorithm analysis" and very little mathematical background is required. BIOGRAPHICAL SKETCH Andrei Broder is chief technology officer of the AltaVista Search division in the AltaVista Company. Previously he was a senior member of the research staff at Compaq's Systems Research Center in Palo Alto, California. He graduated from Technion, Israel's Institute of Technology, and did his Ph.D. in Computer Science at Stanford University under Don Knuth. He has written and co-authored more than 60 scientific papers and numerous patents. His main research interests are the design, analysis, and implementation of probabilistic algorithms and supporting data structures, in particular in the context of web-scale applications.