Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor k-NN documentation #7890

Merged
merged 6 commits into from
Aug 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
98 changes: 30 additions & 68 deletions _search-plugins/knn/approximate-knn.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,8 +76,9 @@ PUT my-knn-index-1
}
}
```
{% include copy-curl.html %}

In the example above, both `knn_vector` fields are configured from method definitions. Additionally, `knn_vector` fields can also be configured from models. You can learn more about this in the [knn_vector data type]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector/) section.
In the preceding example, both `knn_vector` fields are configured using method definitions. Additionally, `knn_vector` fields can be configured using models. For more information, see [k-NN vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector/).

The `knn_vector` data type supports a vector of floats that can have a dimension count of up to 16,000 for the NMSLIB, Faiss, and Lucene engines, as set by the dimension mapping parameter.

Expand Down Expand Up @@ -106,8 +107,8 @@ POST _bulk
{ "my_vector2": [4.5, 5.5, 6.7, 3.7], "price": 4.4 }
{ "index": { "_index": "my-knn-index-1", "_id": "9" } }
{ "my_vector2": [1.5, 5.5, 4.5, 6.4], "price": 8.9 }

```
{% include copy-curl.html %}

Then you can execute an approximate nearest neighbor search on the data using the `knn` query type:

Expand All @@ -125,6 +126,7 @@ GET my-knn-index-1/_search
}
}
```
{% include copy-curl.html %}

### The number of returned results

Expand All @@ -148,10 +150,9 @@ Starting in OpenSearch 2.14, you can use `k`, `min_score`, or `max_distance` for

### Building a k-NN index from a model

For some of the algorithms that we support, the native library index needs to be trained before it can be used. It would be expensive to training every newly created segment, so, instead, we introduce the concept of a *model* that is used to initialize the native library index during segment creation. A *model* is created by calling the [Train API]({{site.url}}{{site.baseurl}}/search-plugins/knn/api#train-a-model), passing in the source of training data as well as the method definition of the model. Once training is complete, the model will be serialized to a k-NN model system index. Then, during indexing, the model is pulled from this index to initialize the segments.
For some of the algorithms that the k-NN plugin supports, the native library index needs to be trained before it can be used. It would be expensive to train every newly created segment, so, instead, the plugin features the concept of a *model* that initializes the native library index during segment creation. You can create a model by calling the [Train API]({{site.url}}{{site.baseurl}}/search-plugins/knn/api#train-a-model) and passing in the source of the training data and the method definition of the model. Once training is complete, the model is serialized to a k-NN model system index. Then, during indexing, the model is pulled from this index to initialize the segments.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above, in the sentence starting with "You can create", it appears that there is a missing word/words after "passing".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The source of the training data and the method definition are passed in so no missing word.

To train a model, we first need an OpenSearch index with training data in it. Training data can come from
any `knn_vector` field that has a dimension matching the dimension of the model you want to create. Training data can be the same data that you are going to index or have in a separate set. Let's create a training index:
To train a model, you first need an OpenSearch index containing training data. Training data can come from any `knn_vector` field that has a dimension matching the dimension of the model you want to create. Training data can be the same data that you are going to index or data in a separate set. To create a training index, send the following request:

```json
PUT /train-index
Expand All @@ -170,6 +171,7 @@ PUT /train-index
}
}
```
{% include copy-curl.html %}

Notice that `index.knn` is not set in the index settings. This ensures that you do not create native library indexes for this index.

Expand All @@ -186,8 +188,9 @@ POST _bulk
{ "index": { "_index": "train-index", "_id": "4" } }
{ "train-field": [1.5, 5.5, 4.5, 6.4]}
```
{% include copy-curl.html %}

After indexing into the training index completes, we can call the Train API:
After indexing into the training index completes, you can call the Train API:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this just be "After indexing completes"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can, but there are 2 different indices here so I think it's better to clarify.


```json
POST /_plugins/_knn/models/my-model/_train
Expand All @@ -207,18 +210,19 @@ POST /_plugins/_knn/models/my-model/_train
}
}
```
{% include copy-curl.html %}

The Train API will return as soon as the training job is started. To check its status, we can use the Get Model API:
The Train API returns as soon as the training job is started. To check the job status, use the Get Model API:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reads as though there is a missing word/words after "returns". Returns what?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API spins off a thread and then returns so no missing word.


```json
GET /_plugins/_knn/models/my-model?filter_path=state&pretty
{
"state": "training"
}
```
{% include copy-curl.html %}

Once the model enters the "created" state, you can create an index that will use this model to initialize its native
library indexes:
Once the model enters the `created` state, you can create an index that will use this model to initialize its native library indexes:

```json
PUT /target-index
Expand All @@ -238,8 +242,10 @@ PUT /target-index
}
}
```
{% include copy-curl.html %}

Lastly, you can add the documents you want to be searched to the index:

Lastly, we can add the documents we want to be searched to the index:
```json
POST _bulk
{ "index": { "_index": "target-index", "_id": "1" } }
Expand All @@ -250,8 +256,8 @@ POST _bulk
{ "target-field": [4.5, 5.5, 6.7, 3.7]}
{ "index": { "_index": "target-index", "_id": "4" } }
{ "target-field": [1.5, 5.5, 4.5, 6.4]}
...
```
{% include copy-curl.html %}

After data is ingested, it can be searched in the same way as any other `knn_vector` field.

Expand All @@ -265,7 +271,7 @@ GET my-knn-index-1/_search
"size": 2,
"query": {
"knn": {
"my_vector2": {
"target-field": {
"vector": [2, 3, 5, 6],
"k": 2,
"method_parameters" : {
Expand Down Expand Up @@ -294,7 +300,7 @@ Engine | Radial query support | Notes

#### `nprobes`

You can provide the `nprobes` parameter when searching an index created using the `ivf` method. The `nprobes` parameter specifies the number of `nprobes` clusters to examine in order to find the top k nearest neighbors. Higher `nprobes` values improve recall at the cost of increased search latency. The value must be positive.
You can provide the `nprobes` parameter when searching an index created using the `ivf` method. The `nprobes` parameter specifies the number of buckets to examine in order to find the top k nearest neighbors. Higher `nprobes` values improve recall at the cost of increased search latency. The value must be positive.

The following table provides information about the `nprobes` parameter for the supported engines.

Expand All @@ -320,68 +326,24 @@ To learn more about using binary vectors with k-NN search, see [Binary k-NN vect

## Spaces

A space corresponds to the function used to measure the distance between two points in order to determine the k-nearest neighbors. From the k-NN perspective, a lower score equates to a closer and better result. This is the opposite of how OpenSearch scores results, where a greater score equates to a better result. To convert distances to OpenSearch scores, we take 1 / (1 + distance). The k-NN plugin supports the following spaces.
A _space_ corresponds to the function used to measure the distance between two points in order to determine the k-nearest neighbors. From the k-NN perspective, a lower score equates to a closer and better result. This is the opposite of how OpenSearch scores results, where a higher score equates to a better result. The k-NN plugin supports the following spaces.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above: Should the first instance of "space" be italicized?

Not every method supports each of these spaces. Be sure to check out [the method documentation]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-index#method-definitions) to make sure the space you are interested in is supported.
{: note.}

| Space type | Distance function ($$d$$ ) | OpenSearch score |
| :--- | :--- | :--- |
| `l1` | $$ d(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^n \lvert x_i - y_i \rvert $$ | $$ score = {1 \over {1 + d} } $$ |
| `l2` | $$ d(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^n (x_i - y_i)^2 $$ | $$ score = {1 \over 1 + d } $$ |
| `linf` | $$ d(\mathbf{x}, \mathbf{y}) = max(\lvert x_i - y_i \rvert) $$ | $$ score = {1 \over 1 + d } $$ |
| `cosinesimil` | $$ d(\mathbf{x}, \mathbf{y}) = 1 - cos { \theta } = 1 - {\mathbf{x} \cdot \mathbf{y} \over \lVert \mathbf{x}\rVert \cdot \lVert \mathbf{y}\rVert}$$$$ = 1 - {\sum_{i=1}^n x_i y_i \over \sqrt{\sum_{i=1}^n x_i^2} \cdot \sqrt{\sum_{i=1}^n y_i^2}}$$, <br> where $$\lVert \mathbf{x}\rVert$$ and $$\lVert \mathbf{y}\rVert$$ represent the norms of vectors $$\mathbf{x}$$ and $$\mathbf{y}$$, respectively. | **NMSLIB** and **Faiss**:<br>$$ score = {1 \over 1 + d } $$ <br><br>**Lucene**:<br>$$ score = {2 - d \over 2}$$ |
| `innerproduct` (supported for Lucene in OpenSearch version 2.13 and later) | **NMSLIB** and **Faiss**:<br> $$ d(\mathbf{x}, \mathbf{y}) = - {\mathbf{x} \cdot \mathbf{y}} = - \sum_{i=1}^n x_i y_i $$ <br><br>**Lucene**:<br> $$ d(\mathbf{x}, \mathbf{y}) = {\mathbf{x} \cdot \mathbf{y}} = \sum_{i=1}^n x_i y_i $$ | **NMSLIB** and **Faiss**:<br> $$ \text{If} d \ge 0, score = {1 \over 1 + d }$$ <br> $$\text{If} d < 0, score = −d + 1$$ <br><br>**Lucene:**<br> $$ \text{If} d > 0, score = d + 1 $$ <br> $$\text{If} d \le 0, score = {1 \over 1 + (-1 \cdot d) }$$ |
| `hamming` (supported for binary vectors in OpenSearch version 2.16 and later) | $$ d(\mathbf{x}, \mathbf{y}) = \text{countSetBits}(\mathbf{x} \oplus \mathbf{y})$$ | $$ score = {1 \over 1 + d } $$ |

<table>
<thead style="text-align: center">
<tr>
<th>Space type</th>
<th>Distance function (d)</th>
<th>OpenSearch score</th>
</tr>
</thead>
<tr>
<td>l1</td>
<td>\[ d(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^n |x_i - y_i| \]</td>
<td>\[ score = {1 \over 1 + d } \]</td>
</tr>
<tr>
<td>l2</td>
<td>\[ d(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^n (x_i - y_i)^2 \]</td>
<td>\[ score = {1 \over 1 + d } \]</td>
</tr>
<tr>
<td>linf</td>
<td>\[ d(\mathbf{x}, \mathbf{y}) = max(|x_i - y_i|) \]</td>
<td>\[ score = {1 \over 1 + d } \]</td>
</tr>
<tr>
<td>cosinesimil</td>
<td>\[ d(\mathbf{x}, \mathbf{y}) = 1 - cos { \theta } = 1 - {\mathbf{x} &middot; \mathbf{y} \over \|\mathbf{x}\| &middot; \|\mathbf{y}\|}\]\[ = 1 -
{\sum_{i=1}^n x_i y_i \over \sqrt{\sum_{i=1}^n x_i^2} &middot; \sqrt{\sum_{i=1}^n y_i^2}}\]
where \(\|\mathbf{x}\|\) and \(\|\mathbf{y}\|\) represent the norms of vectors x and y respectively.</td>
<td><b>nmslib</b> and <b>faiss:</b>\[ score = {1 \over 1 + d } \]<br><b>Lucene:</b>\[ score = {2 - d \over 2}\]</td>
</tr>
<tr>
<td>innerproduct (supported for Lucene in OpenSearch version 2.13 and later)</td>
<td>\[ d(\mathbf{x}, \mathbf{y}) = - {\mathbf{x} &middot; \mathbf{y}} = - \sum_{i=1}^n x_i y_i \]
<br><b>Lucene:</b>
\[ d(\mathbf{x}, \mathbf{y}) = {\mathbf{x} &middot; \mathbf{y}} = \sum_{i=1}^n x_i y_i \]
</td>
<td>\[ \text{If} d \ge 0, \] \[score = {1 \over 1 + d }\] \[\text{If} d < 0, score = &minus;d + 1\]
<br><b>Lucene:</b>
\[ \text{If} d > 0, score = d + 1 \] \[\text{If} d \le 0\] \[score = {1 \over 1 + (-1 &middot; d) }\]
</td>
</tr>
<tr>
<td>hamming (supported for binary vectors in OpenSearch version 2.16 and later)</td>
<td>\[ d(\mathbf{x}, \mathbf{y}) = \text{countSetBits}(\mathbf{x} \oplus \mathbf{y})\]</td>
<td>\[ score = {1 \over 1 + d } \]</td>
</tr>
</table>

The cosine similarity formula does not include the `1 -` prefix. However, because similarity search libraries equates
smaller scores with closer results, they return `1 - cosineSimilarity` for cosine similarity space---that's why `1 -` is
included in the distance function.
The cosine similarity formula does not include the `1 -` prefix. However, because similarity search libraries equate lower scores with closer results, they return `1 - cosineSimilarity` for the cosine similarity space---this is why `1 -` is included in the distance function.
{: .note }

With cosine similarity, it is not valid to pass a zero vector (`[0, 0, ...]`) as input. This is because the magnitude of
such a vector is 0, which raises a `divide by 0` exception in the corresponding formula. Requests
containing the zero vector will be rejected and a corresponding exception will be thrown.
With cosine similarity, it is not valid to pass a zero vector (`[0, 0, ...]`) as input. This is because the magnitude of such a vector is 0, which raises a `divide by 0` exception in the corresponding formula. Requests containing the zero vector will be rejected, and a corresponding exception will be thrown.
{: .note }

The `hamming` space type is supported for binary vectors in OpenSearch version 2.16 and later. For more information, see [Binary k-NN vectors]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#binary-k-nn-vectors).
Expand Down
70 changes: 19 additions & 51 deletions _search-plugins/knn/knn-score-script.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ PUT my-knn-index-1
}
}
```
{% include copy-curl.html %}

If you *only* want to use the score script, you can omit `"index.knn": true`. The benefit of this approach is faster indexing speed and lower memory usage, but you lose the ability to perform standard k-NN queries on the index.
{: .tip}
Expand All @@ -64,8 +65,8 @@ POST _bulk
{ "my_vector2": [4.5, 5.5, 6.7, 3.7], "price": 4.4 }
{ "index": { "_index": "my-knn-index-1", "_id": "9" } }
{ "my_vector2": [1.5, 5.5, 4.5, 6.4], "price": 8.9 }

```
{% include copy-curl.html %}

Finally, you can execute an exact nearest neighbor search on the data using the `knn` script:
```json
Expand All @@ -90,6 +91,7 @@ GET my-knn-index-1/_search
}
}
```
{% include copy-curl.html %}

All parameters are required.

Expand Down Expand Up @@ -122,6 +124,7 @@ PUT my-knn-index-2
}
}
```
{% include copy-curl.html %}

Then add some documents:

Expand All @@ -139,8 +142,8 @@ POST _bulk
{ "my_vector": [20, 20], "color" : "BLUE" }
{ "index": { "_index": "my-knn-index-2", "_id": "6" } }
{ "my_vector": [30, 30], "color" : "BLUE" }

```
{% include copy-curl.html %}

Finally, use the `script_score` query to pre-filter your documents before identifying nearest neighbors:

Expand Down Expand Up @@ -172,6 +175,7 @@ GET my-knn-index-2/_search
}
}
```
{% include copy-curl.html %}

## Getting started with the score script for binary data
The k-NN score script also allows you to run k-NN search on your binary data with the Hamming distance space.
Expand All @@ -195,6 +199,7 @@ PUT my-index
}
}
```
{% include copy-curl.html %}

Then add some documents:

Expand All @@ -212,8 +217,8 @@ POST _bulk
{ "my_binary": "QSBjb3VwbGUgbW9yZSBkb2NzLi4u", "color" : "BLUE" }
{ "index": { "_index": "my-index", "_id": "6" } }
{ "my_binary": "TGFzdCBvbmUh", "color" : "BLUE" }

```
{% include copy-curl.html %}

Finally, use the `script_score` query to pre-filter your documents before identifying nearest neighbors:

Expand Down Expand Up @@ -245,6 +250,7 @@ GET my-index/_search
}
}
```
{% include copy-curl.html %}

Similarly, you can encode your data with the `long` field and run a search:

Expand Down Expand Up @@ -276,58 +282,20 @@ GET my-long-index/_search
}
}
```
{% include copy-curl.html %}

## Spaces

A space corresponds to the function used to measure the distance between two points in order to determine the k-nearest neighbors. From the k-NN perspective, a lower score equates to a closer and better result. This is the opposite of how OpenSearch scores results, where a greater score equates to a better result. The following table illustrates how OpenSearch converts spaces to scores:

<table>
<thead style="text-align: center">
<tr>
<th>spaceType</th>
<th>Distance Function (d)</th>
<th>OpenSearch Score</th>
</tr>
</thead>
<tr>
<td>l1</td>
<td>\[ d(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^n |x_i - y_i| \]</td>
<td>\[ score = {1 \over 1 + d } \]</td>
</tr>
<tr>
<td>l2</td>
<td>\[ d(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^n (x_i - y_i)^2 \]</td>
<td>\[ score = {1 \over 1 + d } \]</td>
</tr>
<tr>
<td>linf</td>
<td>\[ d(\mathbf{x}, \mathbf{y}) = max(|x_i - y_i|) \]</td>
<td>\[ score = {1 \over 1 + d } \]</td>
</tr>
<tr>
<td>cosinesimil</td>
<td>\[ d(\mathbf{x}, \mathbf{y}) = 1 - cos { \theta } = 1 - {\mathbf{x} &middot; \mathbf{y} \over \|\mathbf{x}\| &middot; \|\mathbf{y}\|}\]\[ = 1 -
{\sum_{i=1}^n x_i y_i \over \sqrt{\sum_{i=1}^n x_i^2} &middot; \sqrt{\sum_{i=1}^n y_i^2}}\]
where \(\|\mathbf{x}\|\) and \(\|\mathbf{y}\|\) represent the norms of vectors x and y respectively.</td>
<td>\[ score = 2 - d \]</td>
</tr>
<tr>
<td>innerproduct (supported for Lucene in OpenSearch version 2.13 and later)</td>
<td>\[ d(\mathbf{x}, \mathbf{y}) = - {\mathbf{x} &middot; \mathbf{y}} = - \sum_{i=1}^n x_i y_i \]
</td>
<td>\[ \text{If} d \ge 0, \] \[score = {1 \over 1 + d }\] \[\text{If} d < 0, score = &minus;d + 1\]
</td>
</tr>
<tr>
<td>
hammingbit (supported for binary and long vectors) <br><br>
hamming (supported for binary vectors in OpenSearch version 2.16 and later)
</td>
<td>\[ d(\mathbf{x}, \mathbf{y}) = \text{countSetBits}(\mathbf{x} \oplus \mathbf{y})\]</td>
<td>\[ score = {1 \over 1 + d } \]</td>
</tr>
</table>
A _space_ corresponds to the function used to measure the distance between two points in order to determine the k-nearest neighbors. From the k-NN perspective, a lower score equates to a closer and better result. This is the opposite of how OpenSearch scores results, where a higher score equates to a better result. The following table illustrates how OpenSearch converts spaces to scores.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above: Should the first instance of "space" be italicized?

| Space type | Distance function ($$d$$ ) | OpenSearch score |
| :--- | :--- | :--- |
| `l1` | $$ d(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^n \lvert x_i - y_i \rvert $$ | $$ score = {1 \over {1 + d} } $$ |
| `l2` | $$ d(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^n (x_i - y_i)^2 $$ | $$ score = {1 \over 1 + d } $$ |
| `linf` | $$ d(\mathbf{x}, \mathbf{y}) = max(\lvert x_i - y_i \rvert) $$ | $$ score = {1 \over 1 + d } $$ |
| `cosinesimil` | $$ d(\mathbf{x}, \mathbf{y}) = 1 - cos { \theta } = 1 - {\mathbf{x} \cdot \mathbf{y} \over \lVert \mathbf{x}\rVert \cdot \lVert \mathbf{y}\rVert}$$$$ = 1 - {\sum_{i=1}^n x_i y_i \over \sqrt{\sum_{i=1}^n x_i^2} \cdot \sqrt{\sum_{i=1}^n y_i^2}}$$, <br> where $$\lVert \mathbf{x}\rVert$$ and $$\lVert \mathbf{y}\rVert$$ represent the norms of vectors $$\mathbf{x}$$ and $$\mathbf{y}$$, respectively. | $$ score = 2 - d $$ |
| `innerproduct` (supported for Lucene in OpenSearch version 2.13 and later) | $$ d(\mathbf{x}, \mathbf{y}) = - {\mathbf{x} \cdot \mathbf{y}} = - \sum_{i=1}^n x_i y_i $$ | $$ \text{If} d \ge 0, score = {1 \over 1 + d }$$ <br> $$\text{If} d < 0, score = −d + 1$$ |
| `hammingbit` (supported for binary and long vectors) <br><br>`hamming` (supported for binary vectors in OpenSearch version 2.16 and later) | $$ d(\mathbf{x}, \mathbf{y}) = \text{countSetBits}(\mathbf{x} \oplus \mathbf{y})$$ | $$ score = {1 \over 1 + d } $$ |

Cosine similarity returns a number between -1 and 1, and because OpenSearch relevance scores can't be below 0, the k-NN plugin adds 1 to get the final score.

Expand Down
5 changes: 2 additions & 3 deletions _search-plugins/knn/painless-functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ GET my-knn-index-2/_search
}
}
```
{% include copy-curl.html %}

`field` needs to map to a `knn_vector` field, and `query_value` needs to be a floating point array with the same dimension as `field`.

Expand Down Expand Up @@ -71,7 +72,5 @@ The `hamming` space type is supported for binary vectors in OpenSearch version 2

Because scores can only be positive, this script ranks documents with vector fields higher than those without.

With cosine similarity, it is not valid to pass a zero vector (`[0, 0, ...`]) as input. This is because the magnitude of
such a vector is 0, which raises a `divide by 0` exception when computing the value. Requests
containing the zero vector will be rejected and a corresponding exception will be thrown.
With cosine similarity, it is not valid to pass a zero vector (`[0, 0, ...]`) as input. This is because the magnitude of such a vector is 0, which raises a `divide by 0` exception in the corresponding formula. Requests containing the zero vector will be rejected, and a corresponding exception will be thrown.
{: .note }
Loading