Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minhash token filter needs better documentation #20757

Closed
rpedela opened this issue Oct 5, 2016 · 7 comments · Fixed by #39671
Closed

Minhash token filter needs better documentation #20757

rpedela opened this issue Oct 5, 2016 · 7 comments · Fixed by #39671
Assignees
Labels
>docs General docs changes :Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch

Comments

@rpedela
Copy link

rpedela commented Oct 5, 2016

The Minhash Token Filter documentation only describes the interface for the token filter. That is fine for most token filters, but this one is more complicated.

  1. It should list possible use cases such as an alternative to the "more like this" query.
  2. It should talk about the recommended number of shingles: 5.
  3. It should give small but complete examples for 1 and 2.

In the Lucene issue, they discuss Jaccard and cosine similarities. Did that make it into the final patch? If so, should that be exposed as a setting?

@clintongormley
Copy link
Contributor

@rpedela I know nothing about it. Fancy sending a PR with the details?

@clintongormley clintongormley added >docs General docs changes help wanted adoptme :Search Relevance/Analysis How text is split into tokens labels Oct 7, 2016
@pkmital
Copy link

pkmital commented Oct 3, 2017

Also struggling to use this! Any help would be appreciated.

@romseygeek
Copy link
Contributor

cc @elastic/es-search-aggs

@wayliew
Copy link

wayliew commented Oct 24, 2018

mark it

@mayya-sharipova mayya-sharipova self-assigned this Nov 20, 2018
@mayya-sharipova mayya-sharipova removed the help wanted adoptme label Nov 20, 2018
mayya-sharipova added a commit to mayya-sharipova/elasticsearch that referenced this issue Mar 5, 2019
mayya-sharipova added a commit that referenced this issue Mar 7, 2019
* Add documentation for min_hash filter

Closes #20757
@Kukunin
Copy link

Kukunin commented Apr 16, 2019

Just for the next people that might be confused as me, I want to leave the following hint.

I wondered to see the bucket_count parameter for the min_hash filter, despite the official wiki says nothing about it: https://en.wikipedia.org/wiki/MinHash

I found the clue here https://issues.apache.org/jira/browse/LUCENE-6968?focusedCommentId=15263867&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15263867

After a bit more digging, the single hash and keeping the minimum set can be improved.

See:
[1] http://jmlr.org/proceedings/papers/v32/shrivastava14.pdf
[2] http://www.auai.org/uai2014/proceedings/individuals/225.pdf

In summary: rather than keep the minimum set, split the hash space up into 500 buckets (for a 500 hash fingerprint) and keep the minimum for each bucket. To fill an empty bucket, take the minimum from the next non-empty bucket on the right with rotation.

@Vince-Smith
Copy link

I'm curious if anyone can provide guidance on how to query these min_hash fields once they've been analyzed. I'm not finding it to be intuitive.

@yaroslav-tykhonchuk
Copy link

I'm curious if anyone can provide guidance on how to query these min_hash fields once they've been analyzed. I'm not finding it to be intuitive.

When the field was analyzed using minHash analyzer, you can query the tokens using _termvectors API.

You need to query the document id, noting on which field you want to see terms. The amount of terms your field has, depends on the amount of hash functions, length of your text, to how many shingles you split your text and other parameters.
image

Then if you will query this field using more_like_this query, elasticsearch will use these terms to find similar documents, which will speed up the process a lot.

Additionally, when using like with documents, either _source must be enabled or the fields must be stored or store term_vector. In order to speed up analysis, it could help to store term vectors at index time.

image

@javanna javanna added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>docs General docs changes :Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants