-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minhash token filter needs better documentation #20757
Comments
@rpedela I know nothing about it. Fancy sending a PR with the details? |
Also struggling to use this! Any help would be appreciated. |
cc @elastic/es-search-aggs |
mark it |
* Add documentation for min_hash filter Closes #20757
Just for the next people that might be confused as me, I want to leave the following hint. I wondered to see the I found the clue here https://issues.apache.org/jira/browse/LUCENE-6968?focusedCommentId=15263867&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15263867
|
I'm curious if anyone can provide guidance on how to query these |
When the field was analyzed using minHash analyzer, you can query the tokens using _termvectors API. You need to query the document id, noting on which field you want to see terms. The amount of terms your field has, depends on the amount of hash functions, length of your text, to how many shingles you split your text and other parameters. Then if you will query this field using more_like_this query, elasticsearch will use these terms to find similar documents, which will speed up the process a lot.
|
The Minhash Token Filter documentation only describes the interface for the token filter. That is fine for most token filters, but this one is more complicated.
In the Lucene issue, they discuss Jaccard and cosine similarities. Did that make it into the final patch? If so, should that be exposed as a setting?
The text was updated successfully, but these errors were encountered: