adding min_hash token filter docs opensearch-project#8155

Signed-off-by: Anton Rubin <[email protected]>
AntonEliatra · Sep 12, 2024 · fa72891 · fa72891
1 parent 76486a4
commit fa72891
Show file tree

Hide file tree

Showing 2 changed files with 136 additions and 1 deletion.
diff --git a/_analyzers/token-filters/index.md b/_analyzers/token-filters/index.md
@@ -39,7 +39,7 @@ Token filter | Underlying Lucene token filter|  Description
 `length` | [LengthFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/LengthFilter.html) | Removes tokens whose lengths are shorter or longer than the length range specified by `min` and `max`. 
 `limit` | [LimitTokenCountFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/LimitTokenCountFilter.html) | Limits the number of output tokens. A common use case is to limit the size of document field values based on token count.
 `lowercase` | [LowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/LowerCaseFilter.html) | Converts tokens to lowercase. The default [LowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/LowerCaseFilter.html) is for the English language. You can set the `language` parameter to `greek` (uses [GreekLowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/el/GreekLowerCaseFilter.html)), `irish` (uses [IrishLowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/ga/IrishLowerCaseFilter.html)), or `turkish` (uses [TurkishLowerCaseFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/tr/TurkishLowerCaseFilter.html)).
-`min_hash` | [MinHashFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/minhash/MinHashFilter.html) | Uses the [MinHash technique](https://en.wikipedia.org/wiki/MinHash) to estimate document similarity. Performs the following operations on a token stream sequentially: <br> 1. Hashes each token in the stream. <br> 2. Assigns the hashes to buckets, keeping only the smallest hashes of each bucket. <br> 3. Outputs the smallest hash from each bucket as a token stream.
+[`min_hash`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/min-hash/) | [MinHashFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/minhash/MinHashFilter.html) | Uses the [MinHash technique](https://en.wikipedia.org/wiki/MinHash) to estimate document similarity. Performs the following operations on a token stream sequentially: <br> 1. Hashes each token in the stream. <br> 2. Assigns the hashes to buckets, keeping only the smallest hashes of each bucket. <br> 3. Outputs the smallest hash from each bucket as a token stream.
 `multiplexer` | N/A | Emits multiple tokens at the same position. Runs each token through each of the specified filter lists separately and outputs the results as separate tokens.
 `ngram` | [NGramTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/ngram/NGramTokenFilter.html) | Tokenizes the given token into n-grams of lengths between `min_gram` and `max_gram`.
 Normalization | `arabic_normalization`: [ArabicNormalizer](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/ar/ArabicNormalizer.html) <br> `german_normalization`: [GermanNormalizationFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/de/GermanNormalizationFilter.html) <br> `hindi_normalization`: [HindiNormalizer](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/hi/HindiNormalizer.html) <br> `indic_normalization`: [IndicNormalizer](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/in/IndicNormalizer.html) <br> `sorani_normalization`: [SoraniNormalizer](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/ckb/SoraniNormalizer.html) <br> `persian_normalization`: [PersianNormalizer](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/fa/PersianNormalizer.html) <br> `scandinavian_normalization` : [ScandinavianNormalizationFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ScandinavianNormalizationFilter.html) <br> `scandinavian_folding`: [ScandinavianFoldingFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/ScandinavianFoldingFilter.html) <br> `serbian_normalization`: [SerbianNormalizationFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/sr/SerbianNormalizationFilter.html) | Normalizes the characters of one of the listed languages.

diff --git a/_analyzers/token-filters/min-hash.md b/_analyzers/token-filters/min-hash.md
@@ -0,0 +1,135 @@
+---
+layout: default
+title: Min hash
+parent: Token filters
+nav_order: 270
+---
+
+# Min_hash token filter
+
+The `min_hash` token filter is used to generate hashes for tokens based on a [MinHash](https://en.wikipedia.org/wiki/MinHash) approximation algorithm, which is useful for detecting similarity between documents. The `min_hash` token filter takes a set of tokens (typically from an analyzed field) and generates hashes.
+
+## Parameters
+
+The `min_hash` token filter can be configured with the following parameter:
+
+- `hash_count`: The number of hash values to generate for each token. Increasing this value generally improves the accuracy of similarity estimation but also increases the computational cost. Default: `1`. (Integer, _Optional_)
+- `bucket_count`: The number of hash buckets to use. This affects the granularity of the hashing. A larger number of buckets provides finer granularity and reduces hash collisions but requires more memory. Default: `512`. (Integer, _Optional_)
+- `hash_set_size`: The number of hashes to retain in each bucket. This can influence the quality of the hashing. Larger set sizes may lead to better similarity detection but consume more memory. Default: `1`. (Integer, _Optional_)
+- `with_rotation`: When set to `true`, the filter populates empty buckets with the value from the first non-empty bucket found to its circular right, provided that the `hash_set_size` is `1`. If the `bucket_count` argument exceeds `1`, this setting automatically defaults to `true`, otherwise, it defaults to `false`. (Boolean, _Optional_)
+
+## Example
+
+The following example request creates a new index named `minhash_index` and configures an analyzer with `min_hash` filter:
+
+```json
+PUT /minhash_index
+{
+  "settings": {
+    "analysis": {
+      "filter": {
+        "minhash_filter": {
+          "type": "min_hash",
+          "hash_count": 3,
+          "bucket_count": 512,
+          "hash_set_size": 1,
+          "with_rotation": false
+        }
+      },
+      "analyzer": {
+        "minhash_analyzer": {
+          "type": "custom",
+          "tokenizer": "standard",
+          "filter": [
+            "minhash_filter"
+          ]
+        }
+      }
+    }
+  }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+POST /minhash_index/_analyze
+{
+  "analyzer": "minhash_analyzer",
+  "text": "OpenSearch is very powerful."
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens, however the tokens are not human readable:
+
+```json
+{
+  "tokens" : [
+    {
+      "token" : "\u0000\u0000㳠锯ੲ걌䐩䉵",
+      "start_offset" : 0,
+      "end_offset" : 27,
+      "type" : "MIN_HASH",
+      "position" : 0
+    },
+    {
+      "token" : "\u0000\u0000㳠锯ੲ걌䐩䉵",
+      "start_offset" : 0,
+      "end_offset" : 27,
+      "type" : "MIN_HASH",
+      "position" : 0
+    },
+    ...
+```
+
+In order to see the power of `min_hash` token filter you can use the following python script to compare two strings using the analyzer created previously:
+```
+from opensearchpy import OpenSearch
+from requests.auth import HTTPBasicAuth
+
+# Initialize the OpenSearch client with authentication
+host = 'https://localhost:9200'  # Update if using a different host/port
+auth = ('admin', 'admin')  # Username and password
+
+# Create the OpenSearch client with SSL verification turned off
+client = OpenSearch(
+    hosts=[host],
+    http_auth=auth,
+    use_ssl=True,
+    verify_certs=False,  # Disable SSL certificate validation
+    ssl_show_warn=False  # Suppress SSL warnings in the output
+)
+
+# Function to analyze text and return the minhash tokens
+def analyze_text(index, text):
+    response = client.indices.analyze(
+        index=index,
+        body={
+            "analyzer": "minhash_analyzer",
+            "text": text
+        }
+    )
+    return [token['token'] for token in response['tokens']]
+
+# Analyze two similar texts
+tokens_1 = analyze_text('minhash_index', 'OpenSearch is a powerful search engine.')
+tokens_2 = analyze_text('minhash_index', 'OpenSearch is a very powerful search engine.')
+
+# Calculate Jaccard similarity
+set_1 = set(tokens_1)
+set_2 = set(tokens_2)
+shared_tokens = set_1.intersection(set_2)
+jaccard_similarity = len(shared_tokens) / len(set_1.union(set_2))
+
+print(f"Jaccard Similarity: {jaccard_similarity}")
+```
+
+The response should contain the Jaccard Similarity score:
+
+```
+Jaccard Similarity: 0.8571428571428571
+```