Detection of near-duplicate documents

A sample implementation of Charikar's hash for identification of similar documents.

How to

The calculation of the hash is performend in the following way:

Document is splitted into tokens (words for example) or super-tokens (word tuples)
Each token is represented by its hash value; a traditonal hash function is used
Weights are associated with tokens
A vector V of integers is initialized to 0, length of the vector corresponds to the desired hash size in bits
In a cycle for all token's hash values (h), vector V is updated:
- i^th element is decreased by token's weight if the i^th bit of the hash h is 0, otherwise
- i^th element is increased by token's weight if the i^th bit of the hash h is 1
Finally, signs of elements of V corresponds to the bits of the final fingerprint

Charikar: Similarity Estimation Techniques from Rounding Algorithms, in Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, ACM Press, 2002
Manku, Jain, Sarma: Detecting Near-Duplicates for Web Crawling. in Proceedings of the 16th international conference on World Wide Web, ACM Press, 2007
Henzinger: Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms, in Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, ACM Press, 2006