Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add doc for binary format support in k-NN #7840

Merged
merged 12 commits into from
Aug 1, 2024
219 changes: 219 additions & 0 deletions _field-types/supported-field-types/knn-vector.md
Original file line number Diff line number Diff line change
Expand Up @@ -267,3 +267,222 @@
return Byte(bval)
```
{% include copy.html %}

## Binary vector
By switching from float to binary vectors, users can reduce memory costs by a factor of 32.
Using binary type vector indices can lower operational costs, and maintain high recall performance, making large-scale deployment more economical and efficient.

Check failure on line 273 in _field-types/supported-field-types/knn-vector.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.SubstitutionsError] Use 'indexes' instead of 'indices'. Raw Output: {"message": "[OpenSearch.SubstitutionsError] Use 'indexes' instead of 'indices'.", "location": {"path": "_field-types/supported-field-types/knn-vector.md", "range": {"start": {"line": 273, "column": 26}}}, "severity": "ERROR"}

### Supported Capabilities

Check failure on line 275 in _field-types/supported-field-types/knn-vector.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.HeadingCapitalization] 'Supported Capabilities' is a heading and should be in sentence case. Raw Output: {"message": "[OpenSearch.HeadingCapitalization] 'Supported Capabilities' is a heading and should be in sentence case.", "location": {"path": "_field-types/supported-field-types/knn-vector.md", "range": {"start": {"line": 275, "column": 5}}}, "severity": "ERROR"}

- **Approximate k-NN**: The binary format support is currently available only for the Faiss engine with HNSW and IVF algorithms supported.
- **Script Score k-NN**: Enables the use of binary vectors in script scoring.
- **Painless Extensions**: Allows the use of binary vectors with Painless scripting extensions.

### Requirements
There are several requirements for using binary vectors in OpenSearch k-NN plugin:

#### Data Type

Check failure on line 284 in _field-types/supported-field-types/knn-vector.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.HeadingCapitalization] 'Data Type' is a heading and should be in sentence case. Raw Output: {"message": "[OpenSearch.HeadingCapitalization] 'Data Type' is a heading and should be in sentence case.", "location": {"path": "_field-types/supported-field-types/knn-vector.md", "range": {"start": {"line": 284, "column": 6}}}, "severity": "ERROR"}
The `data_type` of the binary vector index must be `binary`.

#### Space Type

Check failure on line 287 in _field-types/supported-field-types/knn-vector.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.HeadingCapitalization] 'Space Type' is a heading and should be in sentence case. Raw Output: {"message": "[OpenSearch.HeadingCapitalization] 'Space Type' is a heading and should be in sentence case.", "location": {"path": "_field-types/supported-field-types/knn-vector.md", "range": {"start": {"line": 287, "column": 6}}}, "severity": "ERROR"}

The `space_type` of the binary vector index must be `hamming`.

#### Dimension

The `dimension` of the binary vector index must be a multiple of 8.

#### Input Vector

Check failure on line 295 in _field-types/supported-field-types/knn-vector.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.HeadingCapitalization] 'Input Vector' is a heading and should be in sentence case. Raw Output: {"message": "[OpenSearch.HeadingCapitalization] 'Input Vector' is a heading and should be in sentence case.", "location": {"path": "_field-types/supported-field-types/knn-vector.md", "range": {"start": {"line": 295, "column": 6}}}, "severity": "ERROR"}

User should encode their binary data into bytes (int8). For example, the binary sequence `0, 1, 1, 0, 0, 0, 1, 1` should be packed into the byte value 99 as binary format vector input.

### Examples
The following example demonstrates how to create a binary vector index with the Faiss engine and HNSW algorithm:

```json
PUT test-binary-hnsw
{
"settings": {
"index": {
"knn": true
}
},
"mappings": {
"properties": {
"my_vector1": {
"type": "knn_vector",
"dimension": 8,
"data_type": "binary",
"method": {
"name": "hnsw",
"space_type": "hamming",
"engine": "faiss",
"parameters": {
"ef_construction": 128,
"m": 24
}
}
}
}
}
}
```
{% include copy-curl.html %}

Then ingest some documents with binary vectors:

```json
PUT _bulk?refresh=true
{"index": {"_index": "test-binary-hnsw", "_id": "1"}}
{"my_vector": [7], "price": 4.4}
{"index": {"_index": "test-binary-hnsw", "_id": "2"}}
{"my_vector": [10], "price": 14.2}
{"index": {"_index": "test-binary-hnsw", "_id": "3"}}
{"my_vector": [15], "price": 19.1}
{"index": {"_index": "test-binary-hnsw", "_id": "4"}}
{"my_vector": [99], "price": 1.2}
{"index": {"_index": "test-binary-hnsw", "_id": "5"}}
{"my_vector": [80], "price": 16.5}
```
{% include copy-curl.html %}


When querying, be sure to use a binary vector:

```json
GET test-binary-hnsw/_search
{
"size": 2,
"query": {
"knn": {
"my_vector1": {
"vector": [9],
"k": 2
}
}
}
}
```
{% include copy-curl.html %}

The follow example demonstrates how to create a binary vector index with the Faiss engine and IVF algorithm:

Firstly, we need create the training index with binary format data type:
```json
PUT train-index
{
"mappings": {
"properties": {
"train-field": {
"type": "knn_vector",
"dimension": 8,
"data_type": "binary"
}
}
}
}
```
{% include copy-curl.html %}'

Then, ingest some documents with binary vectors to the training index:
```json
PUT _bulk
{ "index": { "_index": "train-index", "_id": "1" } }
{ "train-field": [1] }
{ "index": { "_index": "train-index", "_id": "2" } }
{ "train-field": [2] }
{ "index": { "_index": "train-index", "_id": "3" } }
{ "train-field": [3] }
{ "index": { "_index": "train-index", "_id": "4" } }
{ "train-field": [4] }
{ "index": { "_index": "train-index", "_id": "5" } }
{ "train-field": [5] }
...
```
{% include copy-curl.html %}

Then, train the model with the training index and field in binary format, and specify the method space type as `hamming`:

```json
POST _plugins/_knn/models/test-binary-model/_train
{
"training_index": "train-index",
"training_field": "train-field",
"dimension": 8,
"description": "model with binary data",
"data_type": "binary",
"method": {
"name": "ivf",
"engine": "faiss",
"space_type": "hamming",
"parameters": {
"nlist": 1,
"nprobes":1
}
}
}
```
{% include copy-curl.html %}

Then, make sure the model state is `created`:
```json
GET _plugins/_knn/models/test-binary-model?filter_path=state
```
{% include copy-curl.html %}

Then, create IVF index with the trained model:

```json
PUT test-binary-ivf
{
"settings": {
"index": {
"knn": true
}
},
"mappings": {
"properties": {
"my_vector": {
"type": "knn_vector",
"model_id": "test-binary-model"
}
}
}
}
```
{% include copy-curl.html %}

Then ingest some documents with binary vectors:

```json
PUT _bulk?refresh=true
{"index": {"_index": "test-binary-ivf", "_id": "1"}}
{"my_vector": [7], "price": 4.4}
{"index": {"_index": "test-binary-ivf", "_id": "2"}}
{"my_vector": [10], "price": 14.2}
{"index": {"_index": "test-binary-ivf", "_id": "3"}}
{"my_vector": [15], "price": 19.1}
{"index": {"_index": "test-binary-ivf", "_id": "4"}}
{"my_vector": [99], "price": 1.2}
{"index": {"_index": "test-binary-ivf", "_id": "5"}}
{"my_vector": [80], "price": 16.5}
```
{% include copy-curl.html %}

When querying, be sure to use a binary vector:

```json
GET test-binary-ivf/_search
{
"size": 2,
"query": {
"knn": {
"my_vector1": {
"vector": [9],
"k": 2
}
}
}
}
```
{% include copy-curl.html %}
12 changes: 12 additions & 0 deletions _search-plugins/knn/approximate-knn.md
Original file line number Diff line number Diff line change
Expand Up @@ -314,6 +314,10 @@ To learn about using k-NN search with nested fields, see [k-NN search with neste

To learn more about the radial search feature, see [k-NN radial search]({{site.url}}{{site.baseurl}}/search-plugins/knn/radial-search-knn/).

### Using approximate k-NN with binary vectors

To learn more about using binary vectors with k-NN search, see [k-NN search with binary vectors]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#binary-vector).

## Spaces

A space corresponds to the function used to measure the distance between two points in order to determine the k-nearest neighbors. From the k-NN perspective, a lower score equates to a closer and better result. This is the opposite of how OpenSearch scores results, where a greater score equates to a better result. To convert distances to OpenSearch scores, we take 1 / (1 + distance). The k-NN plugin supports the following spaces.
Expand Down Expand Up @@ -363,6 +367,11 @@ Not every method supports each of these spaces. Be sure to check out [the method
\[ \text{If} d > 0, score = d + 1 \] \[\text{If} d \le 0\] \[score = {1 \over 1 + (-1 · d) }\]
</td>
</tr>
<tr>
<td>hamming</td>
<td>\[ d(\mathbf{x}, \mathbf{y}) = \text{countSetBits}(\mathbf{x} \oplus \mathbf{y})\]</td>
<td>\[ score = {1 \over 1 + d } \]</td>
</tr>
</table>

The cosine similarity formula does not include the `1 -` prefix. However, because similarity search libraries equates
Expand All @@ -374,3 +383,6 @@ With cosine similarity, it is not valid to pass a zero vector (`[0, 0, ...]`) as
such a vector is 0, which raises a `divide by 0` exception in the corresponding formula. Requests
containing the zero vector will be rejected and a corresponding exception will be thrown.
{: .note }

The `hamming` space type is supported for binary format vectors only in OpenSearch 2.16 and later, see [Binary vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#binary-vector).
{: .note}
15 changes: 11 additions & 4 deletions _search-plugins/knn/knn-index.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,10 @@

Starting with k-NN plugin version 2.9, you can use `byte` vectors with the `lucene` engine to reduce the amount of storage space needed. For more information, see [Lucene byte vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#lucene-byte-vector).

## Binary vector

Starting with k-NN plugin version 2.16, you can use `binary` vectors with the `faiss` engine to reduce the amount of storage space needed. For more information, see [Binary vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#binary-vector).

## SIMD optimization for the Faiss engine

Starting with version 2.13, the k-NN plugin supports [Single Instruction Multiple Data (SIMD)](https://en.wikipedia.org/wiki/Single_instruction,_multiple_data) processing if the underlying hardware supports SIMD instructions (AVX2 on x64 architecture and Neon on ARM64 architecture). SIMD is supported by default on Linux machines only for the Faiss engine. SIMD architecture helps boost overall performance by improving indexing throughput and reducing search latency.
Expand Down Expand Up @@ -104,14 +108,17 @@

### Supported Faiss methods

Method name | Requires training | Supported spaces | Description
:--- | :--- | :--- | :---
`hnsw` | false | l2, innerproduct | Hierarchical proximity graph approach to approximate k-NN search.
`ivf` | true | l2, innerproduct | Stands for _inverted file index_. Bucketing approach where vectors are assigned different buckets based on clustering and, during search, only a subset of the buckets is searched.
Method name | Requires training | Supported spaces | Description
:--- | :--- |:------------------------------------------------------------------------------------------| :---
`hnsw` | false | l2, innerproduct, hamming | Hierarchical proximity graph approach to approximate k-NN search.

Check failure on line 113 in _search-plugins/knn/knn-index.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: innerproduct. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: innerproduct. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_search-plugins/knn/knn-index.md", "range": {"start": {"line": 113, "column": 22}}}, "severity": "ERROR"}
`ivf` | true | l2, innerproduct, hamming | Stands for _inverted file index_. Bucketing approach where vectors are assigned different buckets based on clustering and, during search, only a subset of the buckets is searched.

Check failure on line 114 in _search-plugins/knn/knn-index.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: innerproduct. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: innerproduct. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_search-plugins/knn/knn-index.md", "range": {"start": {"line": 114, "column": 20}}}, "severity": "ERROR"}

For hnsw, "innerproduct" is not available when PQ is used.
{: .note}

The `hamming` space type is supported for binary format vectors only in OpenSearch 2.16 and later, see [Binary vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#binary-vector).
{: .note}

#### HNSW parameters

Parameter name | Required | Default | Updatable | Description
Expand Down
10 changes: 9 additions & 1 deletion _search-plugins/knn/knn-score-script.md
Original file line number Diff line number Diff line change
Expand Up @@ -323,6 +323,11 @@ A space corresponds to the function used to measure the distance between two poi
<td>\[ d(\mathbf{x}, \mathbf{y}) = \text{countSetBits}(\mathbf{x} \oplus \mathbf{y})\]</td>
<td>\[ score = {1 \over 1 + d } \]</td>
</tr>
<tr>
junqiu-lei marked this conversation as resolved.
Show resolved Hide resolved
<td>hamming</td>
<td>\[ d(\mathbf{x}, \mathbf{y}) = \text{countSetBits}(\mathbf{x} \oplus \mathbf{y})\]</td>
<td>\[ score = {1 \over 1 + d } \]</td>
</tr>
</table>


Expand All @@ -331,4 +336,7 @@ Cosine similarity returns a number between -1 and 1, and because OpenSearch rele
With cosine similarity, it is not valid to pass a zero vector (`[0, 0, ...`]) as input. This is because the magnitude of
such a vector is 0, which raises a `divide by 0` exception in the corresponding formula. Requests
containing the zero vector will be rejected and a corresponding exception will be thrown.
{: .note }
{: .note }

The `hamming` space type is supported for binary format vectors only in OpenSearch 2.16 and later, see [Binary vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#binary-vector).
{: .note}
4 changes: 4 additions & 0 deletions _search-plugins/knn/painless-functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,10 @@ Function name | Function signature | Description
l2Squared | `float l2Squared (float[] queryVector, doc['vector field'])` | This function calculates the square of the L2 distance (Euclidean distance) between a given query vector and document vectors. The shorter the distance, the more relevant the document is, so this example inverts the return value of the l2Squared function. If the document vector matches the query vector, the result is 0, so this example also adds 1 to the distance to avoid divide by zero errors.
l1Norm | `float l1Norm (float[] queryVector, doc['vector field'])` | This function calculates the square of the L2 distance (Euclidean distance) between a given query vector and document vectors. The shorter the distance, the more relevant the document is, so this example inverts the return value of the l2Squared function. If the document vector matches the query vector, the result is 0, so this example also adds 1 to the distance to avoid divide by zero errors.
cosineSimilarity | `float cosineSimilarity (float[] queryVector, doc['vector field'])` | Cosine similarity is an inner product of the query vector and document vector normalized to both have a length of 1. If the magnitude of the query vector doesn't change throughout the query, you can pass the magnitude of the query vector to improve performance, instead of calculating the magnitude every time for every filtered document:<br /> `float cosineSimilarity (float[] queryVector, doc['vector field'], float normQueryVector)` <br />In general, the range of cosine similarity is [-1, 1]. However, in the case of information retrieval, the cosine similarity of two documents ranges from 0 to 1 because the tf-idf statistic can't be negative. Therefore, the k-NN plugin adds 1.0 in order to always yield a positive cosine similarity score.
hamming | `float hamming (float[] queryVector, doc['vector field'])` | This function calculates the Hamming distance between a given query vector and document vectors. The Hamming distance is the number of positions at which the corresponding elements are different. The shorter the distance, the more relevant the document is, so this example inverts the return value of the Hamming distance.
junqiu-lei marked this conversation as resolved.
Show resolved Hide resolved

The `hamming` space type is supported for binary format vectors only in OpenSearch 2.16 and later, see [Binary vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#binary-vector).
{: .note}

## Constraints

Expand Down
2 changes: 1 addition & 1 deletion _search-plugins/vector-search.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ PUT test-index

You must designate the field that will store vectors as a [`knn_vector`]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector/) field type. OpenSearch supports vectors of up to 16,000 dimensions, each of which is represented as a 32-bit or 16-bit float.

To save storage space, you can use `byte` vectors. For more information, see [Lucene byte vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#lucene-byte-vector).
To save storage space, you can use `byte` or `binary` vectors. For more information, see [Lucene byte vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#lucene-byte-vector) and [Binary vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#binary-vector).

### k-NN vector search

Expand Down
Loading