Minhash token filter needs better documentation #20757

rpedela · 2016-10-05T14:30:59Z

The Minhash Token Filter documentation only describes the interface for the token filter. That is fine for most token filters, but this one is more complicated.

It should list possible use cases such as an alternative to the "more like this" query.
It should talk about the recommended number of shingles: 5.
It should give small but complete examples for 1 and 2.

In the Lucene issue, they discuss Jaccard and cosine similarities. Did that make it into the final patch? If so, should that be exposed as a setting?

clintongormley · 2016-10-07T18:53:54Z

@rpedela I know nothing about it. Fancy sending a PR with the details?

pkmital · 2017-10-03T05:24:23Z

Also struggling to use this! Any help would be appreciated.

romseygeek · 2018-03-14T14:10:05Z

cc @elastic/es-search-aggs

wayliew · 2018-10-24T06:00:55Z

mark it

Closes elastic#20757

* Add documentation for min_hash filter Closes #20757

Closes #20757

Kukunin · 2019-04-16T23:10:03Z

Just for the next people that might be confused as me, I want to leave the following hint.

I wondered to see the bucket_count parameter for the min_hash filter, despite the official wiki says nothing about it: https://en.wikipedia.org/wiki/MinHash

I found the clue here https://issues.apache.org/jira/browse/LUCENE-6968?focusedCommentId=15263867&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15263867

After a bit more digging, the single hash and keeping the minimum set can be improved.

See:
[1] http://jmlr.org/proceedings/papers/v32/shrivastava14.pdf
[2] http://www.auai.org/uai2014/proceedings/individuals/225.pdf

In summary: rather than keep the minimum set, split the hash space up into 500 buckets (for a 500 hash fingerprint) and keep the minimum for each bucket. To fill an empty bucket, take the minimum from the next non-empty bucket on the right with rotation.

Vince-Smith · 2021-06-07T17:42:24Z

I'm curious if anyone can provide guidance on how to query these min_hash fields once they've been analyzed. I'm not finding it to be intuitive.

yaroslav-tykhonchuk · 2022-07-27T09:28:33Z

I'm curious if anyone can provide guidance on how to query these min_hash fields once they've been analyzed. I'm not finding it to be intuitive.

When the field was analyzed using minHash analyzer, you can query the tokens using _termvectors API.

You need to query the document id, noting on which field you want to see terms. The amount of terms your field has, depends on the amount of hash functions, length of your text, to how many shingles you split your text and other parameters.

Then if you will query this field using more_like_this query, elasticsearch will use these terms to find similar documents, which will speed up the process a lot.

Additionally, when using like with documents, either _source must be enabled or the fields must be stored or store term_vector. In order to speed up analysis, it could help to store term vectors at index time.

clintongormley added >docs General docs changes help wanted adoptme :Search Relevance/Analysis How text is split into tokens labels Oct 7, 2016

mayya-sharipova self-assigned this Nov 20, 2018

mayya-sharipova removed the help wanted adoptme label Nov 20, 2018

mayya-sharipova mentioned this issue Feb 25, 2019

More details about how to use min_hash token filter? #38998

Closed

mayya-sharipova added a commit to mayya-sharipova/elasticsearch that referenced this issue Mar 5, 2019

Add documentation for min_hash filter

ebf1979

Closes elastic#20757

mayya-sharipova mentioned this issue Mar 5, 2019

Add documentation for min_hash filter #39671

Merged

mayya-sharipova closed this as completed in #39671 Mar 7, 2019

mayya-sharipova added a commit that referenced this issue Mar 7, 2019

Add documentation for min_hash filter (#39671)

5b852fa

* Add documentation for min_hash filter Closes #20757

mayya-sharipova added a commit that referenced this issue Mar 7, 2019

Add documentation for min_hash filter (#39671)

54d41af

Closes #20757

mayya-sharipova added a commit that referenced this issue Mar 7, 2019

Add documentation for min_hash filter (#39671)

6ce74a9

Closes #20757

javanna added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minhash token filter needs better documentation #20757

Minhash token filter needs better documentation #20757

rpedela commented Oct 5, 2016

clintongormley commented Oct 7, 2016

pkmital commented Oct 3, 2017

romseygeek commented Mar 14, 2018

wayliew commented Oct 24, 2018

Kukunin commented Apr 16, 2019

Vince-Smith commented Jun 7, 2021

yaroslav-tykhonchuk commented Jul 27, 2022

Minhash token filter needs better documentation #20757

Minhash token filter needs better documentation #20757

Comments

rpedela commented Oct 5, 2016

clintongormley commented Oct 7, 2016

pkmital commented Oct 3, 2017

romseygeek commented Mar 14, 2018

wayliew commented Oct 24, 2018

Kukunin commented Apr 16, 2019

Vince-Smith commented Jun 7, 2021

yaroslav-tykhonchuk commented Jul 27, 2022