opensearch-project · kolchfa-aws · Aug 5, 2024 · Aug 1, 2024 · Aug 1, 2024 · Aug 1, 2024
@@ -76,8 +76,9 @@ PUT my-knn-index-1
   }
 }
 ```
+{% include copy-curl.html %}
 
-In the example above, both `knn_vector` fields are configured from method definitions. Additionally, `knn_vector` fields can also be configured from models. You can learn more about this in the [knn_vector data type]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector/) section.
+In the preceding example, both `knn_vector` fields are configured using method definitions. Additionally, `knn_vector` fields can be configured using models. For more information, see [k-NN vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector/).
 
 The `knn_vector` data type supports a vector of floats that can have a dimension count of up to 16,000 for the NMSLIB, Faiss, and Lucene engines, as set by the dimension mapping parameter.
 
@@ -106,8 +107,8 @@ POST _bulk
 { "my_vector2": [4.5, 5.5, 6.7, 3.7], "price": 4.4 }
 { "index": { "_index": "my-knn-index-1", "_id": "9" } }
 { "my_vector2": [1.5, 5.5, 4.5, 6.4], "price": 8.9 }
-
 ```
+{% include copy-curl.html %}
 
 Then you can execute an approximate nearest neighbor search on the data using the `knn` query type:
 
@@ -125,6 +126,7 @@ GET my-knn-index-1/_search
   }
 }
 ```
+{% include copy-curl.html %}
 
 ### The number of returned results
 
@@ -148,10 +150,9 @@ Starting in OpenSearch 2.14, you can use `k`, `min_score`, or `max_distance` for
 
 ### Building a k-NN index from a model
 
-For some of the algorithms that we support, the native library index needs to be trained before it can be used. It would be expensive to training every newly created segment, so, instead, we introduce the concept of a *model* that is used to initialize the native library index during segment creation. A *model* is created by calling the [Train API]({{site.url}}{{site.baseurl}}/search-plugins/knn/api#train-a-model), passing in the source of training data as well as the method definition of the model. Once training is complete, the model will be serialized to a k-NN model system index. Then, during indexing, the model is pulled from this index to initialize the segments.
+For some of the algorithms that the k-NN plugin supports, the native library index needs to be trained before it can be used. It would be expensive to train every newly created segment, so, instead, the plugin features the concept of a *model* that initializes the native library index during segment creation. You can create a model by calling the [Train API]({{site.url}}{{site.baseurl}}/search-plugins/knn/api#train-a-model) and passing in the source of the training data and the method definition of the model. Once training is complete, the model is serialized to a k-NN model system index. Then, during indexing, the model is pulled from this index to initialize the segments.
 
-To train a model, we first need an OpenSearch index with training data in it. Training data can come from
-any `knn_vector` field that has a dimension matching the dimension of the model you want to create. Training data can be the same data that you are going to index or have in a separate set. Let's create a training index:
+To train a model, you first need an OpenSearch index containing training data. Training data can come from any `knn_vector` field that has a dimension matching the dimension of the model you want to create. Training data can be the same data that you are going to index or data in a separate set. To create a training index, send the following request:
 
 ```json
 PUT /train-index
@@ -170,6 +171,7 @@ PUT /train-index
   }
 }
 ```
+{% include copy-curl.html %}
 
 Notice that `index.knn` is not set in the index settings. This ensures that you do not create native library indexes for this index.
 
@@ -186,8 +188,9 @@ POST _bulk
 { "index": { "_index": "train-index", "_id": "4" } }
 { "train-field": [1.5, 5.5, 4.5, 6.4]}
 ```
+{% include copy-curl.html %}
 
-After indexing into the training index completes, we can call the Train API:
+After indexing into the training index completes, you can call the Train API:
 
 ```json
 POST /_plugins/_knn/models/my-model/_train
@@ -207,18 +210,19 @@ POST /_plugins/_knn/models/my-model/_train
   }
 }
 ```
+{% include copy-curl.html %}
 
-The Train API will return as soon as the training job is started. To check its status, we can use the Get Model API:
+The Train API returns as soon as the training job is started. To check the job status, use the Get Model API:
 
 ```json
 GET /_plugins/_knn/models/my-model?filter_path=state&pretty
 {
   "state": "training"
 }
 ```
+{% include copy-curl.html %}
 
-Once the model enters the "created" state, you can create an index that will use this model to initialize its native
-library indexes:
+Once the model enters the `created` state, you can create an index that will use this model to initialize its native library indexes:
 
 ```json
 PUT /target-index
@@ -238,8 +242,10 @@ PUT /target-index
   }
 }
 ```
+{% include copy-curl.html %}
+
+Lastly, you can add the documents you want to be searched to the index:
 
-Lastly, we can add the documents we want to be searched to the index:
 ```json
 POST _bulk
 { "index": { "_index": "target-index", "_id": "1" } }
@@ -250,8 +256,8 @@ POST _bulk
 { "target-field": [4.5, 5.5, 6.7, 3.7]}
 { "index": { "_index": "target-index", "_id": "4" } }
 { "target-field": [1.5, 5.5, 4.5, 6.4]}
-...
 ```
+{% include copy-curl.html %}
 
 After data is ingested, it can be searched in the same way as any other `knn_vector` field.
 
@@ -265,7 +271,7 @@ GET my-knn-index-1/_search
   "size": 2,
   "query": {
     "knn": {
-      "my_vector2": {
+      "target-field": {
         "vector": [2, 3, 5, 6],
         "k": 2,
         "method_parameters" : {
@@ -294,7 +300,7 @@ Engine | Radial query support | Notes
 
 #### `nprobes`
 
-You can provide the `nprobes` parameter when searching an index created using the `ivf` method. The `nprobes` parameter specifies the number of `nprobes` clusters to examine in order to find the top k nearest neighbors. Higher `nprobes` values improve recall at the cost of increased search latency. The value must be positive.
+You can provide the `nprobes` parameter when searching an index created using the `ivf` method. The `nprobes` parameter specifies the number of buckets to examine in order to find the top k nearest neighbors. Higher `nprobes` values improve recall at the cost of increased search latency. The value must be positive.
 
 The following table provides information about the `nprobes` parameter for the supported engines.
 
@@ -320,68 +326,24 @@ To learn more about using binary vectors with k-NN search, see [Binary k-NN vect
 
 ## Spaces
 
-A space corresponds to the function used to measure the distance between two points in order to determine the k-nearest neighbors. From the k-NN perspective, a lower score equates to a closer and better result. This is the opposite of how OpenSearch scores results, where a greater score equates to a better result. To convert distances to OpenSearch scores, we take 1 / (1 + distance). The k-NN plugin supports the following spaces. 
+A _space_ corresponds to the function used to measure the distance between two points in order to determine the k-nearest neighbors. From the k-NN perspective, a lower score equates to a closer and better result. This is the opposite of how OpenSearch scores results, where a higher score equates to a better result. The k-NN plugin supports the following spaces. 
 
 Not every method supports each of these spaces. Be sure to check out [the method documentation]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-index#method-definitions) to make sure the space you are interested in is supported.
 {: note.}
 
+| Space type | Distance function ($$d$$ ) | OpenSearch score |
+| :--- | :--- | :--- |
+| `l1`  | $$ d(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^n \lvert x_i - y_i \rvert $$ | $$ score = {1 \over {1 + d} } $$ |
+| `l2`  | $$ d(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^n (x_i - y_i)^2 $$ | $$ score = {1 \over 1 + d } $$ |
+| `linf` | $$ d(\mathbf{x}, \mathbf{y}) = max(\lvert x_i - y_i \rvert) $$ | $$ score = {1 \over 1 + d } $$ |
+| `cosinesimil` | $$ d(\mathbf{x}, \mathbf{y}) = 1 - cos { \theta } = 1 - {\mathbf{x} \cdot \mathbf{y} \over \lVert \mathbf{x}\rVert \cdot \lVert \mathbf{y}\rVert}$$$$ = 1 - {\sum_{i=1}^n x_i y_i \over \sqrt{\sum_{i=1}^n x_i^2} \cdot \sqrt{\sum_{i=1}^n y_i^2}}$$, <br> where $$\lVert \mathbf{x}\rVert$$ and $$\lVert \mathbf{y}\rVert$$ represent the norms of vectors $$\mathbf{x}$$ and $$\mathbf{y}$$, respectively. | **NMSLIB** and **Faiss**:<br>$$ score = {1 \over 1 + d } $$  <br><br>**Lucene**:<br>$$ score = {2 - d \over 2}$$ |
+| `innerproduct` (supported for Lucene in OpenSearch version 2.13 and later) | **NMSLIB** and **Faiss**:<br> $$ d(\mathbf{x}, \mathbf{y}) = - {\mathbf{x} \cdot \mathbf{y}} = - \sum_{i=1}^n x_i y_i $$  <br><br>**Lucene**:<br> $$ d(\mathbf{x}, \mathbf{y}) = {\mathbf{x} \cdot \mathbf{y}} = \sum_{i=1}^n x_i y_i $$ | **NMSLIB** and **Faiss**:<br> $$ \text{If} d \ge 0,  score = {1 \over 1 + d }$$ <br> $$\text{If} d < 0, score = −d + 1$$  <br><br>**Lucene:**<br> $$ \text{If} d > 0, score = d + 1 $$ <br> $$\text{If} d \le 0, score = {1 \over 1 + (-1 \cdot d) }$$ |
+| `hamming` (supported for binary vectors in OpenSearch version 2.16 and later) | $$ d(\mathbf{x}, \mathbf{y}) = \text{countSetBits}(\mathbf{x} \oplus \mathbf{y})$$ | $$ score = {1 \over 1 + d } $$ |
 
-<table>
-  <thead style="text-align: center">
-  <tr>
-    <th>Space type</th>
-    <th>Distance function (d)</th>
-    <th>OpenSearch score</th>
-  </tr>
-  </thead>
-  <tr>
-    <td>l1</td>
-    <td>\[ d(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^n |x_i - y_i| \]</td>
-    <td>\[ score = {1 \over 1 + d } \]</td>
-  </tr>
-  <tr>
-    <td>l2</td>
-    <td>\[ d(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^n (x_i - y_i)^2 \]</td>
-    <td>\[ score = {1 \over 1 + d } \]</td>
-  </tr>
-  <tr>
-    <td>linf</td>
-    <td>\[ d(\mathbf{x}, \mathbf{y}) = max(|x_i - y_i|) \]</td>
-    <td>\[ score = {1 \over 1 + d } \]</td>
-  </tr>
-  <tr>
-    <td>cosinesimil</td>
-    <td>\[ d(\mathbf{x}, \mathbf{y}) = 1 - cos { \theta } = 1 - {\mathbf{x} &middot; \mathbf{y} \over \|\mathbf{x}\| &middot; \|\mathbf{y}\|}\]\[ = 1 - 
-    {\sum_{i=1}^n x_i y_i \over \sqrt{\sum_{i=1}^n x_i^2} &middot; \sqrt{\sum_{i=1}^n y_i^2}}\]
-    where \(\|\mathbf{x}\|\) and \(\|\mathbf{y}\|\) represent the norms of vectors x and y respectively.</td>
-    <td><b>nmslib</b> and <b>faiss:</b>\[ score = {1 \over 1 + d } \]<br><b>Lucene:</b>\[ score = {2 - d \over 2}\]</td>
-  </tr>
-  <tr>
-    <td>innerproduct (supported for Lucene in OpenSearch version 2.13 and later)</td>
-    <td>\[ d(\mathbf{x}, \mathbf{y}) = - {\mathbf{x} &middot; \mathbf{y}} = - \sum_{i=1}^n x_i y_i \] 
-        <br><b>Lucene:</b>
-        \[ d(\mathbf{x}, \mathbf{y}) = {\mathbf{x} &middot; \mathbf{y}} = \sum_{i=1}^n x_i y_i \]
-    </td>
-    <td>\[ \text{If} d \ge 0, \] \[score = {1 \over 1 + d }\] \[\text{If} d < 0, score = &minus;d + 1\]
-        <br><b>Lucene:</b>
-        \[ \text{If} d > 0, score = d + 1 \] \[\text{If} d \le 0\] \[score = {1 \over 1 + (-1 &middot; d) }\]
-    </td>
-  </tr>
-  <tr>
-    <td>hamming (supported for binary vectors in OpenSearch version 2.16 and later)</td>
-    <td>\[ d(\mathbf{x}, \mathbf{y}) = \text{countSetBits}(\mathbf{x} \oplus \mathbf{y})\]</td>
-    <td>\[ score = {1 \over 1 + d } \]</td>
-  </tr>
-</table>
-
-The cosine similarity formula does not include the `1 -` prefix. However, because similarity search libraries equates
-smaller scores with closer results, they return `1 - cosineSimilarity` for cosine similarity space---that's why `1 -` is
-included in the distance function.
+The cosine similarity formula does not include the `1 -` prefix. However, because similarity search libraries equate lower scores with closer results, they return `1 - cosineSimilarity` for the cosine similarity space---this is why `1 -` is included in the distance function.
 {: .note }
 
-With cosine similarity, it is not valid to pass a zero vector (`[0, 0, ...]`) as input. This is because the magnitude of
-such a vector is 0, which raises a `divide by 0` exception in the corresponding formula. Requests
-containing the zero vector will be rejected and a corresponding exception will be thrown.
+With cosine similarity, it is not valid to pass a zero vector (`[0, 0, ...]`) as input. This is because the magnitude of such a vector is 0, which raises a `divide by 0` exception in the corresponding formula. Requests containing the zero vector will be rejected, and a corresponding exception will be thrown.
 {: .note }
 
 The `hamming` space type is supported for binary vectors in OpenSearch version 2.16 and later. For more information, see [Binary k-NN vectors]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#binary-k-nn-vectors).

@@ -38,6 +38,7 @@ PUT my-knn-index-1
   }
 }
 ```
+{% include copy-curl.html %}
 
 If you *only* want to use the score script, you can omit `"index.knn": true`. The benefit of this approach is faster indexing speed and lower memory usage, but you lose the ability to perform standard k-NN queries on the index.
 {: .tip}
@@ -64,8 +65,8 @@ POST _bulk
 { "my_vector2": [4.5, 5.5, 6.7, 3.7], "price": 4.4 }
 { "index": { "_index": "my-knn-index-1", "_id": "9" } }
 { "my_vector2": [1.5, 5.5, 4.5, 6.4], "price": 8.9 }
-
 ```
+{% include copy-curl.html %}
 
 Finally, you can execute an exact nearest neighbor search on the data using the `knn` script:
 ```json
@@ -90,6 +91,7 @@ GET my-knn-index-1/_search
  }
 }
 ```
+{% include copy-curl.html %}
 
 All parameters are required.
 
@@ -122,6 +124,7 @@ PUT my-knn-index-2
   }
 }
 ```
+{% include copy-curl.html %}
 
 Then add some documents:
 
@@ -139,8 +142,8 @@ POST _bulk
 { "my_vector": [20, 20], "color" : "BLUE" }
 { "index": { "_index": "my-knn-index-2", "_id": "6" } }
 { "my_vector": [30, 30], "color" : "BLUE" }
-
 ```
+{% include copy-curl.html %}
 
 Finally, use the `script_score` query to pre-filter your documents before identifying nearest neighbors:
 
@@ -172,6 +175,7 @@ GET my-knn-index-2/_search
   }
 }
 ```
+{% include copy-curl.html %}
 
 ## Getting started with the score script for binary data
 The k-NN score script also allows you to run k-NN search on your binary data with the Hamming distance space.
@@ -195,6 +199,7 @@ PUT my-index
   }
 }
 ```
+{% include copy-curl.html %}
 
 Then add some documents:
 
@@ -212,8 +217,8 @@ POST _bulk
 { "my_binary": "QSBjb3VwbGUgbW9yZSBkb2NzLi4u", "color" : "BLUE" }
 { "index": { "_index": "my-index", "_id": "6" } }
 { "my_binary":  "TGFzdCBvbmUh", "color" : "BLUE" }
-
 ```
+{% include copy-curl.html %}
 
 Finally, use the `script_score` query to pre-filter your documents before identifying nearest neighbors:
 
@@ -245,6 +250,7 @@ GET my-index/_search
   }
 }
 ```
+{% include copy-curl.html %}
 
 Similarly, you can encode your data with the `long` field and run a search:
 
@@ -276,58 +282,20 @@ GET my-long-index/_search
   }
 }
 ```
+{% include copy-curl.html %}
 
 ## Spaces
 
-A space corresponds to the function used to measure the distance between two points in order to determine the k-nearest neighbors. From the k-NN perspective, a lower score equates to a closer and better result. This is the opposite of how OpenSearch scores results, where a greater score equates to a better result. The following table illustrates how OpenSearch converts spaces to scores:
-
-<table>
-  <thead style="text-align: center">
-  <tr>
-    <th>spaceType</th>
-    <th>Distance Function (d)</th>
-    <th>OpenSearch Score</th>
-  </tr>
-  </thead>
-  <tr>
-    <td>l1</td>
-    <td>\[ d(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^n |x_i - y_i| \]</td>
-    <td>\[ score = {1 \over 1 + d } \]</td>
-  </tr>
-  <tr>
-    <td>l2</td>
-    <td>\[ d(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^n (x_i - y_i)^2 \]</td>
-    <td>\[ score = {1 \over 1 + d } \]</td>
-  </tr>
-  <tr>
-    <td>linf</td>
-    <td>\[ d(\mathbf{x}, \mathbf{y}) = max(|x_i - y_i|) \]</td>
-    <td>\[ score = {1 \over 1 + d } \]</td>
-  </tr>
-  <tr>
-    <td>cosinesimil</td>
-    <td>\[ d(\mathbf{x}, \mathbf{y}) = 1 - cos { \theta } = 1 - {\mathbf{x} &middot; \mathbf{y} \over \|\mathbf{x}\| &middot; \|\mathbf{y}\|}\]\[ = 1 - 
-    {\sum_{i=1}^n x_i y_i \over \sqrt{\sum_{i=1}^n x_i^2} &middot; \sqrt{\sum_{i=1}^n y_i^2}}\]
-    where \(\|\mathbf{x}\|\) and \(\|\mathbf{y}\|\) represent the norms of vectors x and y respectively.</td>
-    <td>\[ score = 2 - d \]</td>
-  </tr>
-  <tr>
-    <td>innerproduct (supported for Lucene in OpenSearch version 2.13 and later)</td>
-    <td>\[ d(\mathbf{x}, \mathbf{y}) = - {\mathbf{x} &middot; \mathbf{y}} = - \sum_{i=1}^n x_i y_i \]
-    </td>
-    <td>\[ \text{If} d \ge 0, \] \[score = {1 \over 1 + d }\] \[\text{If} d < 0, score = &minus;d + 1\]
-    </td>
-  </tr>
-  <tr>
-    <td>
-      hammingbit (supported for binary and long vectors) <br><br>
-      hamming (supported for binary vectors in OpenSearch version 2.16 and later)
-    </td>
-    <td>\[ d(\mathbf{x}, \mathbf{y}) = \text{countSetBits}(\mathbf{x} \oplus \mathbf{y})\]</td>
-    <td>\[ score = {1 \over 1 + d } \]</td>
-  </tr>
-</table>
+A _space_ corresponds to the function used to measure the distance between two points in order to determine the k-nearest neighbors. From the k-NN perspective, a lower score equates to a closer and better result. This is the opposite of how OpenSearch scores results, where a higher score equates to a better result. The following table illustrates how OpenSearch converts spaces to scores.
 
+| Space type | Distance function ($$d$$ ) | OpenSearch score |
+| :--- | :--- | :--- |
+| `l1`  | $$ d(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^n \lvert x_i - y_i \rvert $$ | $$ score = {1 \over {1 + d} } $$ |
+| `l2`  | $$ d(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^n (x_i - y_i)^2 $$ | $$ score = {1 \over 1 + d } $$ |
+| `linf` | $$ d(\mathbf{x}, \mathbf{y}) = max(\lvert x_i - y_i \rvert) $$ | $$ score = {1 \over 1 + d } $$ |
+| `cosinesimil` | $$ d(\mathbf{x}, \mathbf{y}) = 1 - cos { \theta } = 1 - {\mathbf{x} \cdot \mathbf{y} \over \lVert \mathbf{x}\rVert \cdot \lVert \mathbf{y}\rVert}$$$$ = 1 - {\sum_{i=1}^n x_i y_i \over \sqrt{\sum_{i=1}^n x_i^2} \cdot \sqrt{\sum_{i=1}^n y_i^2}}$$, <br> where $$\lVert \mathbf{x}\rVert$$ and $$\lVert \mathbf{y}\rVert$$ represent the norms of vectors $$\mathbf{x}$$ and $$\mathbf{y}$$, respectively. | $$ score = 2 - d $$ |
+| `innerproduct` (supported for Lucene in OpenSearch version 2.13 and later) | $$ d(\mathbf{x}, \mathbf{y}) = - {\mathbf{x} \cdot \mathbf{y}} = - \sum_{i=1}^n x_i y_i $$ | $$ \text{If} d \ge 0,  score = {1 \over 1 + d }$$ <br> $$\text{If} d < 0, score = −d + 1$$ |
+| `hammingbit` (supported for binary and long vectors) <br><br>`hamming` (supported for binary vectors in OpenSearch version 2.16 and later) | $$ d(\mathbf{x}, \mathbf{y}) = \text{countSetBits}(\mathbf{x} \oplus \mathbf{y})$$ | $$ score = {1 \over 1 + d } $$ |
 
 Cosine similarity returns a number between -1 and 1, and because OpenSearch relevance scores can't be below 0, the k-NN plugin adds 1 to get the final score.
 

@@ -41,6 +41,7 @@ GET my-knn-index-2/_search
   }
 }
 ```
+{% include copy-curl.html %}
 
 `field` needs to map to a `knn_vector` field, and `query_value` needs to be a floating point array with the same dimension as `field`.
 
@@ -71,7 +72,5 @@ The `hamming` space type is supported for binary vectors in OpenSearch version 2
 
    Because scores can only be positive, this script ranks documents with vector fields higher than those without.
 
-With cosine similarity, it is not valid to pass a zero vector (`[0, 0, ...`]) as input. This is because the magnitude of
-such a vector is 0, which raises a `divide by 0` exception when computing the value. Requests
-containing the zero vector will be rejected and a corresponding exception will be thrown.
+With cosine similarity, it is not valid to pass a zero vector (`[0, 0, ...]`) as input. This is because the magnitude of such a vector is 0, which raises a `divide by 0` exception in the corresponding formula. Requests containing the zero vector will be rejected, and a corresponding exception will be thrown.
 {: .note }