An HTTP server that allows you to find near duplicate or similar documents given another document. Implements go-raft so it can run as a cluster with other nodes and provide high-availability.
The explanations of the minhash and local sensitivity hashing algorithms used can be found here.
go get github.com/mauidude/deduper
godep go build
./deduper [data directory]
-host
The host the server will run on. Defaults tolocalhost
.-port
The port the server will run on. Defaults to8080
.-leader
Thehost:port
of the leader node, if running as a follower. Defaults to leader mode.-debug
Enables debug output. Defaults tofalse
.
The following options will require testing with your document sizes and overall corpus size. If you change these values, you will need to readd all of your documents.
-bands
The number of bands to use in the minhash algorithm. Defaults to100
.-hashes
The number of hashes to use in the minhash algorithm. Defaults to2
.-shingles
The shingle size to use on the text. Defaults to2
.
godep go test ./...
POST /documents/:id HTTP/1.1
[HTTP headers...]
[document body]
This will add the document to the index under the given id
.
Writes can be given to a leader or follower. Any writes to a follower get proxied to the leader.
POST /documents/similar HTTP/1.1
[HTTP headers...]
[document body]
This POST
takes an optional threshold
argument in the query string which will return only
documents with a similarity greater than or equal to that value. This value must be between
0
and 1
. The default is 0.8
.
This will return a JSON object of matching documents and their similarity. Similarity is a
value between 0
and 1
where 1
is identical and 0
is no shared content.
[
{
"id": "mydocument.txt",
"similarity": 0.934
},
{
"id": "someotherdocument.txt",
"similarity": 0.85
}
]