opensearch-project · kolchfa-aws · Aug 1, 2024 · Jul 25, 2024 · Jul 29, 2024 · Aug 1, 2024
@@ -267,3 +267,222 @@
 return Byte(bval)
 ```
 {% include copy.html %}
+
+## Binary vector
+By switching from float to binary vectors, users can reduce memory costs by a factor of 32.
+Using binary type vector indices can lower operational costs, and maintain high recall performance, making large-scale deployment more economical and efficient.
+
+### Supported Capabilities
+
+- **Approximate k-NN**: The binary format support is currently available only for the Faiss engine with HNSW and IVF algorithms supported.
+- **Script Score k-NN**: Enables the use of binary vectors in script scoring.
+- **Painless Extensions**: Allows the use of binary vectors with Painless scripting extensions.
+
+### Requirements
+There are several requirements for using binary vectors in OpenSearch k-NN plugin:
+
+#### Data Type
+The `data_type` of the binary vector index must be `binary`.
+
+#### Space Type
+
+The `space_type` of the binary vector index must be `hamming`.
+
+#### Dimension
+
+The `dimension` of the binary vector index must be a multiple of 8.
+
+#### Input Vector
+
+User should encode their binary data into bytes (int8). For example, the binary sequence `0, 1, 1, 0, 0, 0, 1, 1` should be packed into the byte value 99 as binary format vector input.
+
+### Examples
+The following example demonstrates how to create a binary vector index with the Faiss engine and HNSW algorithm:
+
+```json
+PUT test-binary-hnsw
+{
+  "settings": {
+    "index": {
+      "knn": true
+    }
+  },
+  "mappings": {
+    "properties": {
+      "my_vector1": {
+        "type": "knn_vector",
+        "dimension": 8,
+        "data_type": "binary",
+        "method": {
+          "name": "hnsw",
+          "space_type": "hamming",
+          "engine": "faiss",
+          "parameters": {
+            "ef_construction": 128,
+            "m": 24
+          }
+        }
+      }
+    }
+  }
+}
+```
+{% include copy-curl.html %}
+
+Then ingest some documents with binary vectors:
+
+```json
+PUT _bulk?refresh=true
+{"index": {"_index": "test-binary-hnsw", "_id": "1"}}
+{"my_vector": [7], "price": 4.4}
+{"index": {"_index": "test-binary-hnsw", "_id": "2"}}
+{"my_vector": [10], "price": 14.2}
+{"index": {"_index": "test-binary-hnsw", "_id": "3"}}
+{"my_vector": [15], "price": 19.1}
+{"index": {"_index": "test-binary-hnsw", "_id": "4"}}
+{"my_vector": [99], "price": 1.2}
+{"index": {"_index": "test-binary-hnsw", "_id": "5"}}
+{"my_vector": [80], "price": 16.5}
+```
+{% include copy-curl.html %}
+
+
+When querying, be sure to use a binary vector:
+
+```json
+GET test-binary-hnsw/_search
+{
+  "size": 2,
+  "query": {
+    "knn": {
+      "my_vector1": {
+        "vector": [9],
+        "k": 2
+      }
+    }
+  }
+}
+```
+{% include copy-curl.html %}
+
+The follow example demonstrates how to create a binary vector index with the Faiss engine and IVF algorithm:
+
+Firstly, we need create the training index with binary format data type:
+```json
+PUT train-index
+{
+  "mappings": {
+    "properties": {
+      "train-field": {
+        "type": "knn_vector",
+        "dimension": 8,
+        "data_type": "binary"
+      }
+    }
+  }
+}
+```
+{% include copy-curl.html %}'
+
+Then, ingest some documents with binary vectors to the training index:
+```json
+PUT _bulk
+{ "index": { "_index": "train-index", "_id": "1" } }
+{ "train-field": [1] }
+{ "index": { "_index": "train-index", "_id": "2" } }
+{ "train-field": [2] }
+{ "index": { "_index": "train-index", "_id": "3" } }
+{ "train-field": [3] }
+{ "index": { "_index": "train-index", "_id": "4" } }
+{ "train-field": [4] }
+{ "index": { "_index": "train-index", "_id": "5" } }
+{ "train-field": [5] }
+...
+```
+{% include copy-curl.html %}
+
+Then, train the model with the training index and field in binary format, and specify the method space type as `hamming`:
+
+```json
+POST _plugins/_knn/models/test-binary-model/_train
+{
+  "training_index": "train-index",
+  "training_field": "train-field",
+  "dimension": 8,
+  "description": "model with binary data",
+  "data_type": "binary",
+  "method": {
+    "name": "ivf",
+    "engine": "faiss",
+    "space_type": "hamming",
+    "parameters": {
+      "nlist": 1,
+      "nprobes":1
+    }
+  }
+}
+```
+{% include copy-curl.html %}
+
+Then, make sure the model state is `created`:
+```json
+GET _plugins/_knn/models/test-binary-model?filter_path=state
+```
+{% include copy-curl.html %}
+
+Then, create IVF index with the trained model:
+
+```json
+PUT test-binary-ivf
+{
+  "settings": {
+    "index": {
+      "knn": true
+    }
+  },
+  "mappings": {
+    "properties": {
+      "my_vector": {
+        "type": "knn_vector",
+        "model_id": "test-binary-model"
+      }
+    }
+  }
+}
+```
+{% include copy-curl.html %}
+
+Then ingest some documents with binary vectors:
+
+```json
+PUT _bulk?refresh=true
+{"index": {"_index": "test-binary-ivf", "_id": "1"}}
+{"my_vector": [7], "price": 4.4}
+{"index": {"_index": "test-binary-ivf", "_id": "2"}}
+{"my_vector": [10], "price": 14.2}
+{"index": {"_index": "test-binary-ivf", "_id": "3"}}
+{"my_vector": [15], "price": 19.1}
+{"index": {"_index": "test-binary-ivf", "_id": "4"}}
+{"my_vector": [99], "price": 1.2}
+{"index": {"_index": "test-binary-ivf", "_id": "5"}}
+{"my_vector": [80], "price": 16.5}
+```
+{% include copy-curl.html %}
+
+When querying, be sure to use a binary vector:
+
+```json
+GET test-binary-ivf/_search
+{
+  "size": 2,
+  "query": {
+    "knn": {
+      "my_vector1": {
+        "vector": [9],
+        "k": 2
+      }
+    }
+  }
+}
+```
+{% include copy-curl.html %}
@@ -314,6 +314,10 @@ To learn about using k-NN search with nested fields, see [k-NN search with neste
 
 To learn more about the radial search feature, see [k-NN radial search]({{site.url}}{{site.baseurl}}/search-plugins/knn/radial-search-knn/).
 
+### Using approximate k-NN with binary vectors
+
+To learn more about using binary vectors with k-NN search, see [k-NN search with binary vectors]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#binary-vector).
+
 ## Spaces
 
 A space corresponds to the function used to measure the distance between two points in order to determine the k-nearest neighbors. From the k-NN perspective, a lower score equates to a closer and better result. This is the opposite of how OpenSearch scores results, where a greater score equates to a better result. To convert distances to OpenSearch scores, we take 1 / (1 + distance). The k-NN plugin supports the following spaces. 
@@ -363,6 +367,11 @@ Not every method supports each of these spaces. Be sure to check out [the method
         \[ \text{If} d > 0, score = d + 1 \] \[\text{If} d \le 0\] \[score = {1 \over 1 + (-1 &middot; d) }\]
     </td>
   </tr>
+  <tr>
+    <td>hamming</td>
+    <td>\[ d(\mathbf{x}, \mathbf{y}) = \text{countSetBits}(\mathbf{x} \oplus \mathbf{y})\]</td>
+    <td>\[ score = {1 \over 1 + d } \]</td>
+  </tr>
 </table>
 
 The cosine similarity formula does not include the `1 -` prefix. However, because similarity search libraries equates
@@ -374,3 +383,6 @@ With cosine similarity, it is not valid to pass a zero vector (`[0, 0, ...]`) as
 such a vector is 0, which raises a `divide by 0` exception in the corresponding formula. Requests
 containing the zero vector will be rejected and a corresponding exception will be thrown.
 {: .note }
+
+The `hamming` space type is supported for binary format vectors only in OpenSearch 2.16 and later, see [Binary vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#binary-vector).
+{: .note}
@@ -45,6 +45,10 @@
 
 Starting with k-NN plugin version 2.9, you can use `byte` vectors with the `lucene` engine to reduce the amount of storage space needed. For more information, see [Lucene byte vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#lucene-byte-vector).
 
+## Binary vector
+
+Starting with k-NN plugin version 2.16, you can use `binary` vectors with the `faiss` engine to reduce the amount of storage space needed. For more information, see [Binary vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#binary-vector).
+
 ## SIMD optimization for the Faiss engine
 
 Starting with version 2.13, the k-NN plugin supports [Single Instruction Multiple Data (SIMD)](https://en.wikipedia.org/wiki/Single_instruction,_multiple_data) processing if the underlying hardware supports SIMD instructions (AVX2 on x64 architecture and Neon on ARM64 architecture). SIMD is supported by default on Linux machines only for the Faiss engine. SIMD architecture helps boost overall performance by improving indexing throughput and reducing search latency.
@@ -104,14 +108,17 @@
 
 ### Supported Faiss methods
 
-Method name | Requires training | Supported spaces | Description
-:--- | :--- | :--- | :---
-`hnsw` | false | l2, innerproduct | Hierarchical proximity graph approach to approximate k-NN search.
-`ivf` | true | l2, innerproduct | Stands for _inverted file index_. Bucketing approach where vectors are assigned different buckets based on clustering and, during search, only a subset of the buckets is searched.
+Method name | Requires training | Supported spaces                                                                          | Description
+:--- | :--- |:------------------------------------------------------------------------------------------| :---
+`hnsw` | false | l2, innerproduct, hamming                                                                 | Hierarchical proximity graph approach to approximate k-NN search.
+`ivf` | true | l2, innerproduct, hamming                                                                 | Stands for _inverted file index_. Bucketing approach where vectors are assigned different buckets based on clustering and, during search, only a subset of the buckets is searched.
 
 For hnsw, "innerproduct" is not available when PQ is used.
 {: .note}
 
+The `hamming` space type is supported for binary format vectors only in OpenSearch 2.16 and later, see [Binary vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#binary-vector).
+{: .note}
+
 #### HNSW parameters
 
 Parameter name | Required | Default | Updatable | Description

@@ -323,6 +323,11 @@ A space corresponds to the function used to measure the distance between two poi
     <td>\[ d(\mathbf{x}, \mathbf{y}) = \text{countSetBits}(\mathbf{x} \oplus \mathbf{y})\]</td>
     <td>\[ score = {1 \over 1 + d } \]</td>
   </tr>
+  <tr>
+    <td>hamming</td>
+    <td>\[ d(\mathbf{x}, \mathbf{y}) = \text{countSetBits}(\mathbf{x} \oplus \mathbf{y})\]</td>
+    <td>\[ score = {1 \over 1 + d } \]</td>
+  </tr>
 </table>
 
 
@@ -331,4 +336,7 @@ Cosine similarity returns a number between -1 and 1, and because OpenSearch rele
 With cosine similarity, it is not valid to pass a zero vector (`[0, 0, ...`]) as input. This is because the magnitude of
 such a vector is 0, which raises a `divide by 0` exception in the corresponding formula. Requests 
 containing the zero vector will be rejected and a corresponding exception will be thrown.
-{: .note }
+{: .note }
+
+The `hamming` space type is supported for binary format vectors only in OpenSearch 2.16 and later, see [Binary vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#binary-vector).
+{: .note}
@@ -52,6 +52,10 @@ Function name | Function signature | Description
 l2Squared | `float l2Squared (float[] queryVector, doc['vector field'])` | This function calculates the square of the L2 distance (Euclidean distance) between a given query vector and document vectors. The shorter the distance, the more relevant the document is, so this example inverts the return value of the l2Squared function. If the document vector matches the query vector, the result is 0, so this example also adds 1 to the distance to avoid divide by zero errors.
 l1Norm | `float l1Norm (float[] queryVector, doc['vector field'])` | This function calculates the square of the L2 distance (Euclidean distance) between a given query vector and document vectors. The shorter the distance, the more relevant the document is, so this example inverts the return value of the l2Squared function. If the document vector matches the query vector, the result is 0, so this example also adds 1 to the distance to avoid divide by zero errors.
 cosineSimilarity | `float cosineSimilarity (float[] queryVector, doc['vector field'])` | Cosine similarity is an inner product of the query vector and document vector normalized to both have a length of 1. If the magnitude of the query vector doesn't change throughout the query, you can pass the magnitude of the query vector to improve performance, instead of calculating the magnitude every time for every filtered document:<br /> `float cosineSimilarity (float[] queryVector, doc['vector field'], float normQueryVector)` <br />In general, the range of cosine similarity is [-1, 1]. However, in the case of information retrieval, the cosine similarity of two documents ranges from 0 to 1 because the tf-idf statistic can't be negative. Therefore, the k-NN plugin adds 1.0 in order to always yield a positive cosine similarity score.
+hamming | `float hamming (float[] queryVector, doc['vector field'])` | This function calculates the Hamming distance between a given query vector and document vectors. The Hamming distance is the number of positions at which the corresponding elements are different. The shorter the distance, the more relevant the document is, so this example inverts the return value of the Hamming distance.
+
+The `hamming` space type is supported for binary format vectors only in OpenSearch 2.16 and later, see [Binary vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#binary-vector).
+{: .note}
 
 ## Constraints
 

@@ -57,7 +57,7 @@ PUT test-index
 
 You must designate the field that will store vectors as a [`knn_vector`]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector/) field type. OpenSearch supports vectors of up to 16,000 dimensions, each of which is represented as a 32-bit or 16-bit float. 
 
-To save storage space, you can use `byte` vectors. For more information, see [Lucene byte vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#lucene-byte-vector).
+To save storage space, you can use `byte` or `binary` vectors. For more information, see [Lucene byte vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#lucene-byte-vector) and [Binary vector]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector#binary-vector).
 
 ### k-NN vector search