Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add doc for binary format support in k-NN #7840

Merged
merged 12 commits into from
Aug 1, 2024
220 changes: 218 additions & 2 deletions _field-types/supported-field-types/knn-vector.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,8 +67,7 @@ PUT test-index

## Model IDs

Model IDs are used when the underlying Approximate k-NN algorithm requires a training step. As a prerequisite, the
model has to be created with the [Train API]({{site.url}}{{site.baseurl}}/search-plugins/knn/api#train-a-model). The
Model IDs are used when the underlying Approximate k-NN algorithm requires a training step. As a prerequisite, the model has to be created with the [Train API]({{site.url}}{{site.baseurl}}/search-plugins/knn/api#train-a-model). The
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved
model contains the information needed to initialize the native library segment files.

```json
Expand Down Expand Up @@ -267,3 +266,220 @@ else:
return Byte(bval)
```
{% include copy.html %}

## Binary k-NN vectors

You can reduce memory costs by a factor of 32 by switching from float to binary vectors.
Using binary vector indexes can lower operational costs while maintaining high recall performance, making large-scale deployment more economical and efficient.

Binary format is available for the following k-NN search types:

- [Approximate k-NN]({{site.url}}{{site.baseurl}}/search-plugins/knn/approximate-knn/): Supports binary vectors only for the Faiss engine with HNSW and IVF algorithms.
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved
- [Script score k-NN]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-score-script/): Enables the use of binary vectors in script scoring.
- [Painless extensions]({{site.url}}{{site.baseurl}}/search-plugins/knn/painless-functions/): Allows the use of binary vectors with Painless scripting extensions.

### Requirements

There are several requirements for using binary vectors in OpenSearch k-NN plugin:
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved

- The `data_type` of the binary vector index must be `binary`.
- The `space_type` of the binary vector index must be `hamming`.
- The `dimension` of the binary vector index must be a multiple of 8.
- You must convert your binary data into 8-bit signed integers (`int8`) in the [-128, 127] range. For example, the binary sequence of eight bits `0, 1, 1, 0, 0, 0, 1, 1` must be converted into its equivalent byte value of `99` to be used as a binary vector input.
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved

### Example: HNSW

To create a binary vector index with the Faiss engine and HNSW algorithm, send the following request:

```json
PUT /test-binary-hnsw
{
"settings": {
"index": {
"knn": true
}
},
"mappings": {
"properties": {
"my_vector1": {
"type": "knn_vector",
"dimension": 8,
"data_type": "binary",
"method": {
"name": "hnsw",
"space_type": "hamming",
"engine": "faiss",
"parameters": {
"ef_construction": 128,
"m": 24
}
}
}
}
}
}
```
{% include copy-curl.html %}

Then ingest some documents containing binary vectors:

```json
PUT _bulk?refresh=true
{"index": {"_index": "test-binary-hnsw", "_id": "1"}}
{"my_vector": [7], "price": 4.4}
{"index": {"_index": "test-binary-hnsw", "_id": "2"}}
{"my_vector": [10], "price": 14.2}
{"index": {"_index": "test-binary-hnsw", "_id": "3"}}
{"my_vector": [15], "price": 19.1}
{"index": {"_index": "test-binary-hnsw", "_id": "4"}}
{"my_vector": [99], "price": 1.2}
{"index": {"_index": "test-binary-hnsw", "_id": "5"}}
{"my_vector": [80], "price": 16.5}
```
{% include copy-curl.html %}

When querying, be sure to use a binary vector:

```json
GET /test-binary-hnsw/_search
{
"size": 2,
"query": {
"knn": {
"my_vector1": {
"vector": [9],
"k": 2
}
}
}
}
```
{% include copy-curl.html %}

### Example: IVF

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any other language we can use other than "IVF method", given the connotation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"IVF" is the algorithm method in ANN, we currently just support IVF and HNSW for binary format.

Explained at: https://github.com/opensearch-project/documentation-website/pull/7840/files#diff-7c24a2151147d0a671328b59d626a542a6ed82e3238a7f2827ff8d41e1aef49eR277

Method definition:

A method definition refers to the underlying configuration of the approximate k-NN algorithm you want to use.

https://opensearch.org/docs/latest/search-plugins/knn/knn-index/#method-definitions

The IVF method requires a training step that creates and trains the model that is used to initialize the native library index during segment creation. For more information, see [Building a k-NN index from a model]({{site.url}}{{site.baseurl}}/search-plugins/knn/approximate-knn/#building-a-k-nn-index-from-a-model).
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved

First, create an index that will contain binary vector training data. Specify the Faiss engine and IVF algorithm and make sure the `dimension` matches the dimension of the model you want to create:
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved

```json
PUT train-index
{
"mappings": {
"properties": {
"train-field": {
"type": "knn_vector",
"dimension": 8,
"data_type": "binary"
}
}
}
}
```
{% include copy-curl.html %}

Ingest training data containing binary vectors to the training index:
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved

```json
PUT _bulk
{ "index": { "_index": "train-index", "_id": "1" } }
{ "train-field": [1] }
{ "index": { "_index": "train-index", "_id": "2" } }
{ "train-field": [2] }
{ "index": { "_index": "train-index", "_id": "3" } }
{ "train-field": [3] }
{ "index": { "_index": "train-index", "_id": "4" } }
{ "train-field": [4] }
{ "index": { "_index": "train-index", "_id": "5" } }
{ "train-field": [5] }
```
{% include copy-curl.html %}

Then, create and train the model named `test-binary-model`. The model will train using the training data from the `train_field` in the `train-index`. Specify the `binary` data type and `hamming` space type:
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved

```json
POST _plugins/_knn/models/test-binary-model/_train
{
"training_index": "train-index",
"training_field": "train-field",
"dimension": 8,
"description": "model with binary data",
"data_type": "binary",
"method": {
"name": "ivf",
"engine": "faiss",
"space_type": "hamming",
"parameters": {
"nlist": 1,
"nprobes": 1
}
}
}
```
{% include copy-curl.html %}

To check the model training status, call the Get Model API:

```json
GET _plugins/_knn/models/test-binary-model?filter_path=state
```
{% include copy-curl.html %}

Once the training is complete, the `state` changes to `created`.

Next, create an index that will initialize its native library indexes using the trained model:

```json
PUT test-binary-ivf
{
"settings": {
"index": {
"knn": true
}
},
"mappings": {
"properties": {
"my_vector": {
"type": "knn_vector",
"model_id": "test-binary-model"
}
}
}
}
```
{% include copy-curl.html %}

Ingest the data that contains binary vectors you want to search into the created index:
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved

```json
PUT _bulk?refresh=true
{"index": {"_index": "test-binary-ivf", "_id": "1"}}
{"my_vector": [7], "price": 4.4}
{"index": {"_index": "test-binary-ivf", "_id": "2"}}
{"my_vector": [10], "price": 14.2}
{"index": {"_index": "test-binary-ivf", "_id": "3"}}
{"my_vector": [15], "price": 19.1}
{"index": {"_index": "test-binary-ivf", "_id": "4"}}
{"my_vector": [99], "price": 1.2}
{"index": {"_index": "test-binary-ivf", "_id": "5"}}
{"my_vector": [80], "price": 16.5}
```
{% include copy-curl.html %}

Finally, search the data. Be sure to provide a binary vector in the k-NN vector field:

```json
GET test-binary-ivf/_search
{
"size": 2,
"query": {
"knn": {
"my_vector1": {
"vector": [9],
"k": 2
}
}
}
}
```
{% include copy-curl.html %}
18 changes: 15 additions & 3 deletions _search-plugins/knn/approximate-knn.md
Original file line number Diff line number Diff line change
Expand Up @@ -314,6 +314,10 @@ To learn about using k-NN search with nested fields, see [k-NN search with neste

To learn more about the radial search feature, see [k-NN radial search]({{site.url}}{{site.baseurl}}/search-plugins/knn/radial-search-knn/).

### Using approximate k-NN with binary vectors

To learn more about using binary vectors with k-NN search, see [Binary k-NN vectors]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#binary-k-nn-vectors).

## Spaces

A space corresponds to the function used to measure the distance between two points in order to determine the k-nearest neighbors. From the k-NN perspective, a lower score equates to a closer and better result. This is the opposite of how OpenSearch scores results, where a greater score equates to a better result. To convert distances to OpenSearch scores, we take 1 / (1 + distance). The k-NN plugin supports the following spaces.
Expand All @@ -325,9 +329,9 @@ Not every method supports each of these spaces. Be sure to check out [the method
<table>
<thead style="text-align: center">
<tr>
<th>spaceType</th>
<th>Distance Function (d)</th>
<th>OpenSearch Score</th>
<th>Space type</th>
<th>Distance function (d)</th>
<th>OpenSearch score</th>
</tr>
</thead>
<tr>
Expand Down Expand Up @@ -363,6 +367,11 @@ Not every method supports each of these spaces. Be sure to check out [the method
\[ \text{If} d > 0, score = d + 1 \] \[\text{If} d \le 0\] \[score = {1 \over 1 + (-1 &middot; d) }\]
</td>
</tr>
<tr>
<td>hamming (supported for binary vectors in OpenSearch version 2.16 and later)</td>
<td>\[ d(\mathbf{x}, \mathbf{y}) = \text{countSetBits}(\mathbf{x} \oplus \mathbf{y})\]</td>
<td>\[ score = {1 \over 1 + d } \]</td>
</tr>
</table>

The cosine similarity formula does not include the `1 -` prefix. However, because similarity search libraries equates
Expand All @@ -374,3 +383,6 @@ With cosine similarity, it is not valid to pass a zero vector (`[0, 0, ...]`) as
such a vector is 0, which raises a `divide by 0` exception in the corresponding formula. Requests
containing the zero vector will be rejected and a corresponding exception will be thrown.
{: .note }

The `hamming` space type is supported for binary vectors in OpenSearch version 2.16 and later. For more information, see [Binary k-NN vectors]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#binary-k-nn-vectors).
{: .note}
13 changes: 10 additions & 3 deletions _search-plugins/knn/knn-index.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,10 @@

Starting with k-NN plugin version 2.9, you can use `byte` vectors with the `lucene` engine to reduce the amount of storage space needed. For more information, see [Lucene byte vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#lucene-byte-vector).

## Binary vector

Starting with k-NN plugin version 2.16, you can use `binary` vectors with the `faiss` engine to reduce the amount of storage space needed. For more information, see [Binary k-NN vectors]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#binary-k-nn-vectors).
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved

## SIMD optimization for the Faiss engine

Starting with version 2.13, the k-NN plugin supports [Single Instruction Multiple Data (SIMD)](https://en.wikipedia.org/wiki/Single_instruction,_multiple_data) processing if the underlying hardware supports SIMD instructions (AVX2 on x64 architecture and Neon on ARM64 architecture). SIMD is supported by default on Linux machines only for the Faiss engine. SIMD architecture helps boost overall performance by improving indexing throughput and reducing search latency.
Expand Down Expand Up @@ -105,13 +109,16 @@
### Supported Faiss methods

Method name | Requires training | Supported spaces | Description
:--- | :--- | :--- | :---
`hnsw` | false | l2, innerproduct | Hierarchical proximity graph approach to approximate k-NN search.
`ivf` | true | l2, innerproduct | Stands for _inverted file index_. Bucketing approach where vectors are assigned different buckets based on clustering and, during search, only a subset of the buckets is searched.
:--- | :--- |:---| :---
`hnsw` | false | l2, innerproduct, hamming | Hierarchical proximity graph approach to approximate k-NN search.

Check failure on line 113 in _search-plugins/knn/knn-index.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: innerproduct. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: innerproduct. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_search-plugins/knn/knn-index.md", "range": {"start": {"line": 113, "column": 22}}}, "severity": "ERROR"}
`ivf` | true | l2, innerproduct, hamming | Stands for _inverted file index_. Bucketing approach where vectors are assigned different buckets based on clustering and, during search, only a subset of the buckets is searched.

Check failure on line 114 in _search-plugins/knn/knn-index.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: innerproduct. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: innerproduct. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_search-plugins/knn/knn-index.md", "range": {"start": {"line": 114, "column": 20}}}, "severity": "ERROR"}

For hnsw, "innerproduct" is not available when PQ is used.
{: .note}

The `hamming` space type is supported for binary vectors in OpenSearch version 2.16 and later. For more information, see [Binary k-NN vectors]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#binary-k-nn-vectors).
{: .note}

#### HNSW parameters

Parameter name | Required | Default | Updatable | Description
Expand Down
14 changes: 9 additions & 5 deletions _search-plugins/knn/knn-score-script.md
Original file line number Diff line number Diff line change
Expand Up @@ -319,7 +319,10 @@
</td>
</tr>
<tr>
<td>hammingbit</td>
<td>
hammingbit (supported for binary and long vectors) <br><br>

Check failure on line 323 in _search-plugins/knn/knn-score-script.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: hammingbit. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: hammingbit. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_search-plugins/knn/knn-score-script.md", "range": {"start": {"line": 323, "column": 7}}}, "severity": "ERROR"}
hamming (supported for binary vectors in OpenSearch version 2.16 and later)
</td>
<td>\[ d(\mathbf{x}, \mathbf{y}) = \text{countSetBits}(\mathbf{x} \oplus \mathbf{y})\]</td>
<td>\[ score = {1 \over 1 + d } \]</td>
</tr>
Expand All @@ -328,7 +331,8 @@

Cosine similarity returns a number between -1 and 1, and because OpenSearch relevance scores can't be below 0, the k-NN plugin adds 1 to get the final score.

With cosine similarity, it is not valid to pass a zero vector (`[0, 0, ...`]) as input. This is because the magnitude of
such a vector is 0, which raises a `divide by 0` exception in the corresponding formula. Requests
containing the zero vector will be rejected and a corresponding exception will be thrown.
{: .note }
With cosine similarity, it is not valid to pass a zero vector (`[0, 0, ... ]`) as input. This is because the magnitude of such a vector is 0, which raises a `divide by 0` exception in the corresponding formula. Requests containing the zero vector will be rejected and a corresponding exception will be thrown.
kolchfa-aws marked this conversation as resolved.
Show resolved Hide resolved
{: .note }

The `hamming` space type is supported for binary vectors in OpenSearch version 2.16 and later. For more information, see [Binary k-NN vectors]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#binary-k-nn-vectors).
{: .note}
4 changes: 4 additions & 0 deletions _search-plugins/knn/painless-functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,10 @@ Function name | Function signature | Description
l2Squared | `float l2Squared (float[] queryVector, doc['vector field'])` | This function calculates the square of the L2 distance (Euclidean distance) between a given query vector and document vectors. The shorter the distance, the more relevant the document is, so this example inverts the return value of the l2Squared function. If the document vector matches the query vector, the result is 0, so this example also adds 1 to the distance to avoid divide by zero errors.
l1Norm | `float l1Norm (float[] queryVector, doc['vector field'])` | This function calculates the square of the L2 distance (Euclidean distance) between a given query vector and document vectors. The shorter the distance, the more relevant the document is, so this example inverts the return value of the l2Squared function. If the document vector matches the query vector, the result is 0, so this example also adds 1 to the distance to avoid divide by zero errors.
cosineSimilarity | `float cosineSimilarity (float[] queryVector, doc['vector field'])` | Cosine similarity is an inner product of the query vector and document vector normalized to both have a length of 1. If the magnitude of the query vector doesn't change throughout the query, you can pass the magnitude of the query vector to improve performance, instead of calculating the magnitude every time for every filtered document:<br /> `float cosineSimilarity (float[] queryVector, doc['vector field'], float normQueryVector)` <br />In general, the range of cosine similarity is [-1, 1]. However, in the case of information retrieval, the cosine similarity of two documents ranges from 0 to 1 because the tf-idf statistic can't be negative. Therefore, the k-NN plugin adds 1.0 in order to always yield a positive cosine similarity score.
hamming | `float hamming (float[] queryVector, doc['vector field'])` | This function calculates the Hamming distance between a given query vector and document vectors. The Hamming distance is the number of positions at which the corresponding elements are different. The shorter the distance, the more relevant the document is, so this example inverts the return value of the Hamming distance.
junqiu-lei marked this conversation as resolved.
Show resolved Hide resolved

The `hamming` space type is supported for binary vectors in OpenSearch version 2.16 and later. For more information, see [Binary k-NN vectors]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#binary-k-nn-vectors).
{: .note}

## Constraints

Expand Down
2 changes: 1 addition & 1 deletion _search-plugins/vector-search.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ PUT test-index

You must designate the field that will store vectors as a [`knn_vector`]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector/) field type. OpenSearch supports vectors of up to 16,000 dimensions, each of which is represented as a 32-bit or 16-bit float.

To save storage space, you can use `byte` vectors. For more information, see [Lucene byte vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#lucene-byte-vector).
To save storage space, you can use `byte` or `binary` vectors. For more information, see [Lucene byte vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#lucene-byte-vector) and [Binary k-NN vectors]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#binary-k-nn-vectors).

### k-NN vector search

Expand Down
Loading