-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add documentation for min_hash filter #39671
Add documentation for min_hash filter #39671
Conversation
Pinging @elastic/es-search |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great addition to the docs, I left some very minor comments.
One general thought I was having: I understand why it makes sense to start with a sort of "overview" and theory, but since or docs also work as a kind of reference guide, maybe we should aim for a very brief summary (like the existing one, maybe extended slightly) followed by the table of parameters, then add the more detailed "theory" and usage sections afterwards.
Also I was wondering if it would make sense to add a small example of how to actually use any such min-hashed field in a query, e.g. for near duplicate detecton etc... or if this would go beyond the scope of our documentation.
docs/reference/analysis/tokenfilters/minhash-tokenfilter.asciidoc
Outdated
Show resolved
Hide resolved
docs/reference/analysis/tokenfilters/minhash-tokenfilter.asciidoc
Outdated
Show resolved
Hide resolved
internally each shingle is hashed into to 128-bit hash, you should choose | ||
`k` small enough so that all possible | ||
different k-words shingles can be hashed to 128-bit hash with | ||
minimal collision. 5-word shingles typically work well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for my own education, do we have any blogs or knowledge articles around this? Or is this advice taken from the Wikipedia article or other sources?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cbuescher I took an advice on 5-word shingle
from the MinHash filter sourcecode in Lucene
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats interesting, would you mind linking to that source?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cbuescher Thanks for the suggetion. I opted not to include the link to this source, as I am afraid as the sourcecode changes this link becomes invalid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the original PR that adds min_hash
, it looks like we were not sure about the 5 word suggestion, and instead encouraged 2 word shingles: #20206 (comment). It would be nice if there was a reference or set of experiments to help confirm a good default value... I didn't manage to find one in a quick search, but will keep a lookout. The right choice seems like it would depend on the use case as well (for example similarity search vs. duplicate detection).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jtibshirani Thanks a lot for the review. I think the best for now is to remove this line completely "5-word shingles typically work well.", as there are conflicting suggestions what shingle size works best. Once we have better sources (external or from our own experiments), we can add shingle size suggestions to the file. Is this fine with you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like a good plan to me!
docs/reference/analysis/tokenfilters/minhash-tokenfilter.asciidoc
Outdated
Show resolved
Hide resolved
@cbuescher Thanks a lot for the review. I have addressed your comments in the 2nd commit.
I indeed very much wanted to add this example, but I opted not to do this. The reason for this that I am not sure how to set the best query for this. A general idea is to partition resulting hashed tokens into bands; tokens in a single band should be joined by |
3dc2671
to
a273050
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mayya-sharipova Thanks for adressing my previous comments, I left one more suggestion but nothing that requires another review. Feel free to adress or not.
When he replies, we can update the documentation with this query information as well
Fine by me, this PR already is a great addition. Maybe an extended example would also be better suited for a blog post or something like it. I'd be really interested in real-life usages of this.
* 6.7: Fix CCR HLRC docs Introduce forget follower API (elastic#39718) 6.6.2 release notes. Update release notes for 6.7.0 Add documentation for min_hash filter (elastic#39671) Unmute testIndividualActionsTimeout Unmute testFollowIndexAndCloseNode Use unwrapped cause to determine if node is closing (elastic#39723) Don’t ack if unable to remove failing replica (elastic#39584) Wipe Snapshots Before Indices in RestTests (elastic#39662) (elastic#39765) Bug fix for AnnotatedTextHighlighter (elastic#39525) Fix Snapshot BwC with Version 5.6.x (elastic#39737) Fix occasional SearchServiceTests failure (elastic#39697) Correct date in daterange-aggregation.asciidoc (elastic#39727) Add a note to purge the ingest-geoip plugin (elastic#39553)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just read over the documentation to learn more about this token filter and had a couple thoughts. I found these additions very helpful!
will provide a higher guarantee that different tokens are | ||
indexed to different buckets. | ||
** to improve the recall, | ||
you should increase `hash_token` parameter. For example, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be hash_count
?
internally each shingle is hashed into to 128-bit hash, you should choose | ||
`k` small enough so that all possible | ||
different k-words shingles can be hashed to 128-bit hash with | ||
minimal collision. 5-word shingles typically work well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the original PR that adds min_hash
, it looks like we were not sure about the 5 word suggestion, and instead encouraged 2 word shingles: #20206 (comment). It would be nice if there was a reference or set of experiments to help confirm a good default value... I didn't manage to find one in a quick search, but will keep a lookout. The right choice seems like it would depend on the use case as well (for example similarity search vs. duplicate detection).
The documentation does not show or explain how to query for For others that are trying to figure out how to query for I think it would be a good idea to add such an example to the documentation. |
Closes #20757