Address feedback

elastic · Mar 5, 2019 · a273050 · a273050
1 parent ebf1979
commit a273050
Showing 1 changed file with 49 additions and 43 deletions.
diff --git a/docs/reference/analysis/tokenfilters/minhash-tokenfilter.asciidoc b/docs/reference/analysis/tokenfilters/minhash-tokenfilter.asciidoc
@@ -1,40 +1,6 @@
 [[analysis-minhash-tokenfilter]]
 === MinHash Token Filter
 
-==== Theory
-Minhash token filter allows you to hash documents for similarity search.
-Similarity search, or nearest neighbor search is a complex problem.
-A naive solution requires an exhaustive pairwise comparison between a query
-document and every document in an index. This is a prohibitive operation
-if the index is large. A number of approximate nearest neighbor search
-solutions have been developed to make similarity search more practical and
-computationally feasible. One of these solutions involves hashing of documents.
-
-Documents are hashed in a way that similar documents are more likely
-to produce the same hash code and are put into the same hash bucket,
-while dissimilar documents are more likely to be hashed into
-different hash buckets. This type of hashing is known as
-locality sensitive hashing (LSH).
-
-Depending on what constitutes the similarity between documents,
-various LSH functions https://arxiv.org/abs/1408.2927[have been proposed].
-For Jaccard similarity, a popular LSH function is
-https://en.wikipedia.org/wiki/MinHash[MinHash].
-A general idea of the way MinHash produces a signature for a document
-is by applying a random permutation over the whole index vocabulary (random
-numbering for the vocabulary), and recording the minimum value for this permutation
-for the document (the minimum number for a vocabulary word that is present
-in the document). The permutations are run several times;
-combining the minimum values for all of them will constitute a
-signature for the document.
-
-In practice, instead of random permutations, a number of hash functions
-are chosen. A hash function calculates a hash code for each of a
-document's tokens and chooses the minimum hash code among them.
-The minimum hash codes from all hash functions are combined
-to form a signature for the document.
-
-==== MinHash Token Filter in Elasticsearch
 The `min_hash` token filter hashes each token of the token stream and divides
 the resulting hashes into buckets, keeping the lowest-valued hashes per
 bucket. It then returns these hashes as tokens.
@@ -68,11 +34,11 @@ minimal collision. 5-word shingles typically work well.
 
 * choosing the right settings for `hash_count`, `bucket_count` and
 `hash_set_size` needs some experimentation.
-** To improve the precision, you should increase `bucket_count` or
+** to improve the precision, you should increase `bucket_count` or
 `hash_set_size`. Higher values of `bucket_count` or `hash_set_size`
 will provide a higher guarantee that different tokens are
 indexed to different buckets.
-** To improve the recall,
+** to improve the recall,
 you should increase `hash_token` parameter. For example,
 setting `hash_count=2`, will make each token to be hashed in
 two different ways, thus increasing the number of potential
@@ -86,6 +52,41 @@ Thus, each document's size will be increased by around 8Kb.
 that it doesn't matter how many times a document contains a certain token,
 only that if it contains it or not.
 
+==== Theory
+MinHash token filter allows you to hash documents for similarity search.
+Similarity search, or nearest neighbor search is a complex problem.
+A naive solution requires an exhaustive pairwise comparison between a query
+document and every document in an index. This is a prohibitive operation
+if the index is large. A number of approximate nearest neighbor search
+solutions have been developed to make similarity search more practical and
+computationally feasible. One of these solutions involves hashing of documents.
+
+Documents are hashed in a way that similar documents are more likely
+to produce the same hash code and are put into the same hash bucket,
+while dissimilar documents are more likely to be hashed into
+different hash buckets. This type of hashing is known as
+locality sensitive hashing (LSH).
+
+Depending on what constitutes the similarity between documents,
+various LSH functions https://arxiv.org/abs/1408.2927[have been proposed].
+For https://en.wikipedia.org/wiki/Jaccard_index[Jaccard similarity], a popular
+LSH function is https://en.wikipedia.org/wiki/MinHash[MinHash].
+A general idea of the way MinHash produces a signature for a document
+is by applying a random permutation over the whole index vocabulary (random
+numbering for the vocabulary), and recording the minimum value for this permutation
+for the document (the minimum number for a vocabulary word that is present
+in the document). The permutations are run several times;
+combining the minimum values for all of them will constitute a
+signature for the document.
+
+In practice, instead of random permutations, a number of hash functions
+are chosen. A hash function calculates a hash code for each of a
+document's tokens and chooses the minimum hash code among them.
+The minimum hash codes from all hash functions are combined
+to form a signature for the document.
+
+
+==== Example of setting MinHash Token Filter in Elasticsearch
 Here is an example of setting up a `min_hash` filter:
 
 [source,js]
@@ -95,18 +96,18 @@ POST /index1
   "settings": {
     "analysis": {
       "filter": {
-        "my_shingle_filter": {
+        "my_shingle_filter": { <1>
           "type": "shingle",
           "min_shingle_size": 5,
           "max_shingle_size": 5,
           "output_unigrams": false
         },
         "my_minhash_filter": {
           "type": "min_hash",
-          "hash_count": 1,
-          "bucket_count": 512,
-          "hash_set_size": 1,
-          "with_rotation": true
+          "hash_count": 1,   <2>
+          "bucket_count": 512, <3>
+          "hash_set_size": 1, <4>
+          "with_rotation": true <5>
         }
       },
       "analyzer": {
@@ -123,11 +124,16 @@ POST /index1
   "mappings": {
     "properties": {
       "text": {
-        "type": "text",
+        "fingerprint": "text",
         "analyzer": "my_analyzer"
       }
     }
   }
 }
 --------------------------------------------------
-// NOTCONSOLE
+// NOTCONSOLE
+<1> setting a shingle filter with 5-word shingles
+<2> setting min_hash filter to hash with 1 hash
+<3> setting min_hash filter to hash tokens into 512 buckets
+<4> setting min_hash filter to keep only a single smallest hash in each bucket
+<5> setting min_hash filter to fill empty buckets with values from neighboring buckets