Skip to content

Commit

Permalink
Address feedback
Browse files Browse the repository at this point in the history
  • Loading branch information
mayya-sharipova committed Mar 5, 2019
1 parent ebf1979 commit a273050
Showing 1 changed file with 49 additions and 43 deletions.
92 changes: 49 additions & 43 deletions docs/reference/analysis/tokenfilters/minhash-tokenfilter.asciidoc
Original file line number Diff line number Diff line change
@@ -1,40 +1,6 @@
[[analysis-minhash-tokenfilter]]
=== MinHash Token Filter

==== Theory
Minhash token filter allows you to hash documents for similarity search.
Similarity search, or nearest neighbor search is a complex problem.
A naive solution requires an exhaustive pairwise comparison between a query
document and every document in an index. This is a prohibitive operation
if the index is large. A number of approximate nearest neighbor search
solutions have been developed to make similarity search more practical and
computationally feasible. One of these solutions involves hashing of documents.

Documents are hashed in a way that similar documents are more likely
to produce the same hash code and are put into the same hash bucket,
while dissimilar documents are more likely to be hashed into
different hash buckets. This type of hashing is known as
locality sensitive hashing (LSH).

Depending on what constitutes the similarity between documents,
various LSH functions https://arxiv.org/abs/1408.2927[have been proposed].
For Jaccard similarity, a popular LSH function is
https://en.wikipedia.org/wiki/MinHash[MinHash].
A general idea of the way MinHash produces a signature for a document
is by applying a random permutation over the whole index vocabulary (random
numbering for the vocabulary), and recording the minimum value for this permutation
for the document (the minimum number for a vocabulary word that is present
in the document). The permutations are run several times;
combining the minimum values for all of them will constitute a
signature for the document.

In practice, instead of random permutations, a number of hash functions
are chosen. A hash function calculates a hash code for each of a
document's tokens and chooses the minimum hash code among them.
The minimum hash codes from all hash functions are combined
to form a signature for the document.

==== MinHash Token Filter in Elasticsearch
The `min_hash` token filter hashes each token of the token stream and divides
the resulting hashes into buckets, keeping the lowest-valued hashes per
bucket. It then returns these hashes as tokens.
Expand Down Expand Up @@ -68,11 +34,11 @@ minimal collision. 5-word shingles typically work well.

* choosing the right settings for `hash_count`, `bucket_count` and
`hash_set_size` needs some experimentation.
** To improve the precision, you should increase `bucket_count` or
** to improve the precision, you should increase `bucket_count` or
`hash_set_size`. Higher values of `bucket_count` or `hash_set_size`
will provide a higher guarantee that different tokens are
indexed to different buckets.
** To improve the recall,
** to improve the recall,
you should increase `hash_token` parameter. For example,
setting `hash_count=2`, will make each token to be hashed in
two different ways, thus increasing the number of potential
Expand All @@ -86,6 +52,41 @@ Thus, each document's size will be increased by around 8Kb.
that it doesn't matter how many times a document contains a certain token,
only that if it contains it or not.

==== Theory
MinHash token filter allows you to hash documents for similarity search.
Similarity search, or nearest neighbor search is a complex problem.
A naive solution requires an exhaustive pairwise comparison between a query
document and every document in an index. This is a prohibitive operation
if the index is large. A number of approximate nearest neighbor search
solutions have been developed to make similarity search more practical and
computationally feasible. One of these solutions involves hashing of documents.

Documents are hashed in a way that similar documents are more likely
to produce the same hash code and are put into the same hash bucket,
while dissimilar documents are more likely to be hashed into
different hash buckets. This type of hashing is known as
locality sensitive hashing (LSH).

Depending on what constitutes the similarity between documents,
various LSH functions https://arxiv.org/abs/1408.2927[have been proposed].
For https://en.wikipedia.org/wiki/Jaccard_index[Jaccard similarity], a popular
LSH function is https://en.wikipedia.org/wiki/MinHash[MinHash].
A general idea of the way MinHash produces a signature for a document
is by applying a random permutation over the whole index vocabulary (random
numbering for the vocabulary), and recording the minimum value for this permutation
for the document (the minimum number for a vocabulary word that is present
in the document). The permutations are run several times;
combining the minimum values for all of them will constitute a
signature for the document.

In practice, instead of random permutations, a number of hash functions
are chosen. A hash function calculates a hash code for each of a
document's tokens and chooses the minimum hash code among them.
The minimum hash codes from all hash functions are combined
to form a signature for the document.


==== Example of setting MinHash Token Filter in Elasticsearch
Here is an example of setting up a `min_hash` filter:

[source,js]
Expand All @@ -95,18 +96,18 @@ POST /index1
"settings": {
"analysis": {
"filter": {
"my_shingle_filter": {
"my_shingle_filter": { <1>
"type": "shingle",
"min_shingle_size": 5,
"max_shingle_size": 5,
"output_unigrams": false
},
"my_minhash_filter": {
"type": "min_hash",
"hash_count": 1,
"bucket_count": 512,
"hash_set_size": 1,
"with_rotation": true
"hash_count": 1, <2>
"bucket_count": 512, <3>
"hash_set_size": 1, <4>
"with_rotation": true <5>
}
},
"analyzer": {
Expand All @@ -123,11 +124,16 @@ POST /index1
"mappings": {
"properties": {
"text": {
"type": "text",
"fingerprint": "text",
"analyzer": "my_analyzer"
}
}
}
}
--------------------------------------------------
// NOTCONSOLE
// NOTCONSOLE
<1> setting a shingle filter with 5-word shingles
<2> setting min_hash filter to hash with 1 hash
<3> setting min_hash filter to hash tokens into 512 buckets
<4> setting min_hash filter to keep only a single smallest hash in each bucket
<5> setting min_hash filter to fill empty buckets with values from neighboring buckets

0 comments on commit a273050

Please sign in to comment.