Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Binary vector support #1767

Closed
heemin32 opened this issue Jun 19, 2024 · 2 comments
Closed

[RFC] Binary vector support #1767

heemin32 opened this issue Jun 19, 2024 · 2 comments
Assignees
Labels
enhancement Features Introduces a new unit of functionality that satisfies a requirement v2.16.0

Comments

@heemin32
Copy link
Collaborator

heemin32 commented Jun 19, 2024

Overview

The increasing demand for binary format support from customers is becoming evident, with numerous instances demonstrating strong recall rates when using binary values generated from large language models (LLMs). For example, Cohere's introduction of the Cohere Embed embedding model, which inherently supports binary embeddings, has shown that binary vectors can retain 90-98% of the original search quality.

Given the impressive recall rates achieved with binary vectors, a growing number of users are seeking to leverage binary vectors in OpenSearch KNN indices to significantly reduce memory costs. By moving from float32 vectors to binary vectors, you can reduce the memory requirement by a factor of 32.

Implementing support for binary vectors in OpenSearch KNN indices is thus a highly beneficial feature, addressing customer demand and significantly lowering operational costs. This capability not only ensures high recall performance but also makes large-scale deployment more economically viable, facilitating greater adoption and efficiency.

Scope

  1. Support binary format with Faiss, HNSW
  2. Script scoring on binary vector using hamming distance
  3. Support binary format with Faiss, IVF
  4. Binary format support with radial search on Faiss

Out of scope

  1. Support binary format with Nmslib
  2. Binary quantization (Will be handled in a separate issue)
  3. Rescoring (Will be handled in a separate issue)

Future extension

  1. Binary format support with Lucene
  2. Support of other space type(ex. Jaccard)
  3. Exploring of java long data type to store binary vector inside OpenSearch for any performance advantage

Data flow diagram

Screenshot 2024-06-18 at 4 01 40 PM

API

Input format

User should pack their binary into byte(int8). For example, for a binary value 0, 1, 1, 0, 0, 0, 1, 1, it will be 99.

Index setting

Because we are using int8 format as input, the dimension should be a multiple of 8. We are going to support new data_type, binary. With binary data type, the hammingdistance is the only space type that we are going to support as of now. If space type is not specified, the hammingdistance will be a default value for the binary data type.

PUT test-index
{
  "settings": {// no change
    "index": {
      "knn": true,
      "knn.algo_param.ef_search": 100
    }
  },
  "mappings": {
    "properties": {
      "my_vector1": {
        "type": "knn_vector", // no change
        "dimension": 24, // This should be multiple of 8
        "data_type": "binary",// new data type
        "method": {
          "name": "hnsw", // and also ivf
          "space_type": "hamming", // only support hamming
          "engine": "faiss"
        }
      }
    }
  }
}

Ingestion

8 bits 0, 0, 0, 0, 1, 0, 1, 0 → 1 byte 10
8 bits 1, 0, 0, 0, 1, 0, 1, 0 → 1 byte -119
8 bits 0, 1, 1, 1, 1, 0, 1, 1 → 1 byte 123

PUT test-index/_doc/1
{
   "my_vector1": [-23, 0, 123]
}

Query

Query vector will have same data format as ingestion which is binary vectors packed in byte(-128 ~ 127)

{
  "query": {
      "knn": {
        "my_vector_1": {
          "vector": [12, -12, 120],
          "k": 2
        }
    }
  }
}

Reference

Meta issue: #1764

@heemin32 heemin32 added Features Introduces a new unit of functionality that satisfies a requirement and removed untriaged labels Jun 19, 2024
@heemin32 heemin32 self-assigned this Jun 19, 2024
@jmazanec15
Copy link
Member

Overall, looks good. Interface looks good. A few comments

Might be good to point to can you reference #81.

"dimension": 24, // This should be multiple of 8

In future, can we just ignore extra bits?

"space_type": "hammingdistance", // only support hammingdistance

No, I think hamming is good here. We used hammingbit for script scoring, but the bit portion is redundant. (ref: https://opensearch.org/docs/latest/search-plugins/knn/knn-score-script/)

Will there be a lower level design coming up?

@heemin32
Copy link
Collaborator Author

heemin32 commented Jun 24, 2024

Might be good to point to can you reference #81.

Added reference to #1764 which has the link to #81

In future, can we just ignore extra bits?

There is no much difference in user experience even if we ignore extra bit because the packing in byte is done from user side. If we support an input format of an array of binary value(ex 0, 1, 1, 0) in the future, we will pad with zero for extra bit to make it a multiple of 8.

No, I think hamming is good here.

Got it. Updated the RFC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Features Introduces a new unit of functionality that satisfies a requirement v2.16.0
Projects
Status: Done
Development

No branches or pull requests

3 participants