Add documentation for min_hash filter #39671

mayya-sharipova · 2019-03-05T00:22:04Z

Closes #20757

Closes elastic#20757

elasticmachine · 2019-03-05T00:22:54Z

Pinging @elastic/es-search

cbuescher

Great addition to the docs, I left some very minor comments.
One general thought I was having: I understand why it makes sense to start with a sort of "overview" and theory, but since or docs also work as a kind of reference guide, maybe we should aim for a very brief summary (like the existing one, maybe extended slightly) followed by the table of parameters, then add the more detailed "theory" and usage sections afterwards.
Also I was wondering if it would make sense to add a small example of how to actually use any such min-hashed field in a query, e.g. for near duplicate detecton etc... or if this would go beyond the scope of our documentation.

docs/reference/analysis/tokenfilters/minhash-tokenfilter.asciidoc

cbuescher · 2019-03-05T14:09:14Z

docs/reference/analysis/tokenfilters/minhash-tokenfilter.asciidoc

+internally each shingle is hashed into to 128-bit hash, you should choose
+`k` small enough so that all possible
+different k-words shingles can be hashed to 128-bit hash with
+minimal collision. 5-word shingles typically work well.


Just for my own education, do we have any blogs or knowledge articles around this? Or is this advice taken from the Wikipedia article or other sources?

@cbuescher I took an advice on 5-word shingle from the MinHash filter sourcecode in Lucene

Thats interesting, would you mind linking to that source?

@cbuescher Thanks for the suggetion. I opted not to include the link to this source, as I am afraid as the sourcecode changes this link becomes invalid.

In the original PR that adds min_hash, it looks like we were not sure about the 5 word suggestion, and instead encouraged 2 word shingles: #20206 (comment). It would be nice if there was a reference or set of experiments to help confirm a good default value... I didn't manage to find one in a quick search, but will keep a lookout. The right choice seems like it would depend on the use case as well (for example similarity search vs. duplicate detection).

@jtibshirani Thanks a lot for the review. I think the best for now is to remove this line completely "5-word shingles typically work well.", as there are conflicting suggestions what shingle size works best. Once we have better sources (external or from our own experiments), we can add shingle size suggestions to the file. Is this fine with you?

This sounds like a good plan to me!

docs/reference/analysis/tokenfilters/minhash-tokenfilter.asciidoc

mayya-sharipova · 2019-03-05T19:23:08Z

@cbuescher Thanks a lot for the review. I have addressed your comments in the 2nd commit.

Also I was wondering if it would make sense to add a small example of how to actually use any such min-hashed field in a query, e.g. for near duplicate detecton etc... or if this would go beyond the scope of our documentation.

I indeed very much wanted to add this example, but I opted not to do this. The reason for this that I am not sure how to set the best query for this. A general idea is to partition resulting hashed tokens into bands; tokens in a single band should be joined by AND, and bands should be joined with other bands by OR. I have asked the author of the MinHash filter for his idea how this query should be set. When he replies, we can update the documentation with this query information as well.

cbuescher

@mayya-sharipova Thanks for adressing my previous comments, I left one more suggestion but nothing that requires another review. Feel free to adress or not.

When he replies, we can update the documentation with this query information as well

Fine by me, this PR already is a great addition. Maybe an extended example would also be better suited for a blog post or something like it. I'd be really interested in real-life usages of this.

Closes #20757

* 6.7: Fix CCR HLRC docs Introduce forget follower API (elastic#39718) 6.6.2 release notes. Update release notes for 6.7.0 Add documentation for min_hash filter (elastic#39671) Unmute testIndividualActionsTimeout Unmute testFollowIndexAndCloseNode Use unwrapped cause to determine if node is closing (elastic#39723) Don’t ack if unable to remove failing replica (elastic#39584) Wipe Snapshots Before Indices in RestTests (elastic#39662) (elastic#39765) Bug fix for AnnotatedTextHighlighter (elastic#39525) Fix Snapshot BwC with Version 5.6.x (elastic#39737) Fix occasional SearchServiceTests failure (elastic#39697) Correct date in daterange-aggregation.asciidoc (elastic#39727) Add a note to purge the ingest-geoip plugin (elastic#39553)

jtibshirani

I just read over the documentation to learn more about this token filter and had a couple thoughts. I found these additions very helpful!

jtibshirani · 2019-03-07T23:05:58Z

docs/reference/analysis/tokenfilters/minhash-tokenfilter.asciidoc

+will provide a higher guarantee that different tokens are
+indexed to different buckets.
+** to improve the recall,
+you should increase `hash_token` parameter. For example,


Should this be hash_count?

jtibshirani · 2019-03-08T01:01:32Z

docs/reference/analysis/tokenfilters/minhash-tokenfilter.asciidoc

+internally each shingle is hashed into to 128-bit hash, you should choose
+`k` small enough so that all possible
+different k-words shingles can be hashed to 128-bit hash with
+minimal collision. 5-word shingles typically work well.


In the original PR that adds min_hash, it looks like we were not sure about the 5 word suggestion, and instead encouraged 2 word shingles: #20206 (comment). It would be nice if there was a reference or set of experiments to help confirm a good default value... I didn't manage to find one in a quick search, but will keep a lookout. The right choice seems like it would depend on the use case as well (for example similarity search vs. duplicate detection).

Related to #39671

rdvdijk · 2024-08-06T06:02:55Z

The documentation does not show or explain how to query for min_hash values.

For others that are trying to figure out how to query for min_hash values, and end up here (as I did), an explanation on how to do this can be found in the comments of the issue: #20757 (comment)

I think it would be a good idea to add such an example to the documentation.

Add documentation for min_hash filter

ebf1979

Closes elastic#20757

mayya-sharipova added >docs General docs changes v6.7.0 v7.2.0 v8.0.0 :Search Relevance/Analysis How text is split into tokens labels Mar 5, 2019

cbuescher self-assigned this Mar 5, 2019

cbuescher reviewed Mar 5, 2019

View reviewed changes

docs/reference/analysis/tokenfilters/minhash-tokenfilter.asciidoc Outdated Show resolved Hide resolved

Address feedback

a273050

mayya-sharipova force-pushed the documentation_minhash_token_filter branch from 3dc2671 to a273050 Compare March 5, 2019 19:31

cbuescher approved these changes Mar 6, 2019

View reviewed changes

mayya-sharipova merged commit 5b852fa into elastic:master Mar 7, 2019

mayya-sharipova deleted the documentation_minhash_token_filter branch March 7, 2019 13:47

mayya-sharipova added a commit that referenced this pull request Mar 7, 2019

Add documentation for min_hash filter (#39671)

54d41af

Closes #20757

mayya-sharipova added a commit that referenced this pull request Mar 7, 2019

Add documentation for min_hash filter (#39671)

6ce74a9

Closes #20757

jtibshirani reviewed Mar 8, 2019

View reviewed changes

mayya-sharipova added a commit that referenced this pull request Mar 8, 2019

Correct errors in min_hash filter documentation

aad9397

Related to #39671

mayya-sharipova added a commit that referenced this pull request Mar 8, 2019

Correct errors in min_hash filter documentation

674c5b2

Related to #39671

mayya-sharipova added a commit that referenced this pull request Mar 8, 2019

Correct errors in min_hash filter documentation

671a209

Related to #39671

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add documentation for min_hash filter #39671

Add documentation for min_hash filter #39671

mayya-sharipova commented Mar 5, 2019

elasticmachine commented Mar 5, 2019

cbuescher left a comment

cbuescher Mar 5, 2019

mayya-sharipova Mar 5, 2019

cbuescher Mar 6, 2019

mayya-sharipova Mar 7, 2019

jtibshirani Mar 8, 2019

mayya-sharipova Mar 8, 2019

jtibshirani Mar 8, 2019

mayya-sharipova commented Mar 5, 2019

cbuescher left a comment

jtibshirani left a comment

jtibshirani Mar 7, 2019

jtibshirani Mar 8, 2019

rdvdijk commented Aug 6, 2024

Add documentation for min_hash filter #39671

Add documentation for min_hash filter #39671

Conversation

mayya-sharipova commented Mar 5, 2019

elasticmachine commented Mar 5, 2019

cbuescher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mayya-sharipova commented Mar 5, 2019

cbuescher left a comment

Choose a reason for hiding this comment

jtibshirani left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdvdijk commented Aug 6, 2024