Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for min_hash filter #39671

Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
121 changes: 119 additions & 2 deletions docs/reference/analysis/tokenfilters/minhash-tokenfilter.asciidoc
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
[[analysis-minhash-tokenfilter]]
=== Minhash Token Filter
=== MinHash Token Filter

A token filter of type `min_hash` hashes each token of the token stream and divides
The `min_hash` token filter hashes each token of the token stream and divides
the resulting hashes into buckets, keeping the lowest-valued hashes per
bucket. It then returns these hashes as tokens.

Expand All @@ -20,3 +20,120 @@ The following are settings that can be set for a `min_hash` token filter.
bucket to its circular right. Only takes effect if hash_set_size is equal to one.
Defaults to `true` if bucket_count is greater than one, else `false`.
|=======================================================================

Some points to consider while setting up a `min_hash` filter:

* `min_hash` filter input tokens should typically be k-words shingles produced
from <<analysis-shingle-tokenfilter,shingle token filter>>. You should
choose `k` large enough so that the probability of any given shingle
occurring in a document is low. At the same time, as
internally each shingle is hashed into to 128-bit hash, you should choose
`k` small enough so that all possible
different k-words shingles can be hashed to 128-bit hash with
minimal collision. 5-word shingles typically work well.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my own education, do we have any blogs or knowledge articles around this? Or is this advice taken from the Wikipedia article or other sources?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbuescher I took an advice on 5-word shingle from the MinHash filter sourcecode in Lucene

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats interesting, would you mind linking to that source?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbuescher Thanks for the suggetion. I opted not to include the link to this source, as I am afraid as the sourcecode changes this link becomes invalid.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the original PR that adds min_hash, it looks like we were not sure about the 5 word suggestion, and instead encouraged 2 word shingles: #20206 (comment). It would be nice if there was a reference or set of experiments to help confirm a good default value... I didn't manage to find one in a quick search, but will keep a lookout. The right choice seems like it would depend on the use case as well (for example similarity search vs. duplicate detection).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtibshirani Thanks a lot for the review. I think the best for now is to remove this line completely "5-word shingles typically work well.", as there are conflicting suggestions what shingle size works best. Once we have better sources (external or from our own experiments), we can add shingle size suggestions to the file. Is this fine with you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like a good plan to me!


* choosing the right settings for `hash_count`, `bucket_count` and
`hash_set_size` needs some experimentation.
** to improve the precision, you should increase `bucket_count` or
`hash_set_size`. Higher values of `bucket_count` or `hash_set_size`
will provide a higher guarantee that different tokens are
indexed to different buckets.
** to improve the recall,
you should increase `hash_token` parameter. For example,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be hash_count?

setting `hash_count=2`, will make each token to be hashed in
two different ways, thus increasing the number of potential
candidates for search.

* the default settings makes the `min_hash` filter to produce for
each document 512 `min_hash` tokens, each is of size 16 bytes.
Thus, each document's size will be increased by around 8Kb.

* `min_hash` filter is used to hash for Jaccard similarity. This means
that it doesn't matter how many times a document contains a certain token,
only that if it contains it or not.

==== Theory
MinHash token filter allows you to hash documents for similarity search.
Similarity search, or nearest neighbor search is a complex problem.
A naive solution requires an exhaustive pairwise comparison between a query
document and every document in an index. This is a prohibitive operation
if the index is large. A number of approximate nearest neighbor search
solutions have been developed to make similarity search more practical and
computationally feasible. One of these solutions involves hashing of documents.

Documents are hashed in a way that similar documents are more likely
to produce the same hash code and are put into the same hash bucket,
while dissimilar documents are more likely to be hashed into
different hash buckets. This type of hashing is known as
locality sensitive hashing (LSH).

Depending on what constitutes the similarity between documents,
various LSH functions https://arxiv.org/abs/1408.2927[have been proposed].
For https://en.wikipedia.org/wiki/Jaccard_index[Jaccard similarity], a popular
LSH function is https://en.wikipedia.org/wiki/MinHash[MinHash].
A general idea of the way MinHash produces a signature for a document
is by applying a random permutation over the whole index vocabulary (random
numbering for the vocabulary), and recording the minimum value for this permutation
for the document (the minimum number for a vocabulary word that is present
in the document). The permutations are run several times;
combining the minimum values for all of them will constitute a
signature for the document.

In practice, instead of random permutations, a number of hash functions
are chosen. A hash function calculates a hash code for each of a
document's tokens and chooses the minimum hash code among them.
The minimum hash codes from all hash functions are combined
to form a signature for the document.


==== Example of setting MinHash Token Filter in Elasticsearch
Here is an example of setting up a `min_hash` filter:

[source,js]
--------------------------------------------------
POST /index1
{
"settings": {
"analysis": {
"filter": {
"my_shingle_filter": { <1>
"type": "shingle",
"min_shingle_size": 5,
"max_shingle_size": 5,
"output_unigrams": false
},
"my_minhash_filter": {
"type": "min_hash",
"hash_count": 1, <2>
"bucket_count": 512, <3>
"hash_set_size": 1, <4>
"with_rotation": true <5>
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"my_shingle_filter",
"my_minhash_filter"
]
}
}
}
},
"mappings": {
"properties": {
"text": {
"fingerprint": "text",
"analyzer": "my_analyzer"
}
}
}
}
--------------------------------------------------
// NOTCONSOLE
<1> setting a shingle filter with 5-word shingles
<2> setting min_hash filter to hash with 1 hash
<3> setting min_hash filter to hash tokens into 512 buckets
<4> setting min_hash filter to keep only a single smallest hash in each bucket
<5> setting min_hash filter to fill empty buckets with values from neighboring buckets