Skip to content

Find near-duplicate documents using minhashing implemented in Go.

Notifications You must be signed in to change notification settings

mauidude/deduper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Deduper

An HTTP server that allows you to find near duplicate or similar documents given another document. Implements go-raft so it can run as a cluster with other nodes and provide high-availability.

The explanations of the minhash and local sensitivity hashing algorithms used can be found here.

Installation

go get github.com/mauidude/deduper

Building

godep go build

Running

./deduper [data directory]

Options

  • -host The host the server will run on. Defaults to localhost.
  • -port The port the server will run on. Defaults to 8080.
  • -leader The host:port of the leader node, if running as a follower. Defaults to leader mode.
  • -debug Enables debug output. Defaults to false.

The following options will require testing with your document sizes and overall corpus size. If you change these values, you will need to readd all of your documents.

  • -bands The number of bands to use in the minhash algorithm. Defaults to 100.
  • -hashes The number of hashes to use in the minhash algorithm. Defaults to 2.
  • -shingles The shingle size to use on the text. Defaults to 2.

Testing

godep go test ./...

API

Adding a document

POST /documents/:id HTTP/1.1
[HTTP headers...]

[document body]

This will add the document to the index under the given id.

Writes can be given to a leader or follower. Any writes to a follower get proxied to the leader.

Finding similar documents

POST /documents/similar HTTP/1.1
[HTTP headers...]

[document body]

This POST takes an optional threshold argument in the query string which will return only documents with a similarity greater than or equal to that value. This value must be between 0 and 1. The default is 0.8.

This will return a JSON object of matching documents and their similarity. Similarity is a value between 0 and 1 where 1 is identical and 0 is no shared content.

[
    {
        "id": "mydocument.txt",
        "similarity": 0.934
    },
    {
        "id": "someotherdocument.txt",
        "similarity": 0.85
    }
]

About

Find near-duplicate documents using minhashing implemented in Go.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages