-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Binary vector support #1767
Comments
Overall, looks good. Interface looks good. A few comments Might be good to point to can you reference #81.
In future, can we just ignore extra bits?
No, I think hamming is good here. We used hammingbit for script scoring, but the bit portion is redundant. (ref: https://opensearch.org/docs/latest/search-plugins/knn/knn-score-script/) Will there be a lower level design coming up? |
Added reference to #1764 which has the link to #81
There is no much difference in user experience even if we ignore extra bit because the packing in byte is done from user side. If we support an input format of an array of binary value(ex 0, 1, 1, 0) in the future, we will pad with zero for extra bit to make it a multiple of 8.
Got it. Updated the RFC. |
Overview
The increasing demand for binary format support from customers is becoming evident, with numerous instances demonstrating strong recall rates when using binary values generated from large language models (LLMs). For example, Cohere's introduction of the Cohere Embed embedding model, which inherently supports binary embeddings, has shown that binary vectors can retain 90-98% of the original search quality.
Given the impressive recall rates achieved with binary vectors, a growing number of users are seeking to leverage binary vectors in OpenSearch KNN indices to significantly reduce memory costs. By moving from float32 vectors to binary vectors, you can reduce the memory requirement by a factor of 32.
Implementing support for binary vectors in OpenSearch KNN indices is thus a highly beneficial feature, addressing customer demand and significantly lowering operational costs. This capability not only ensures high recall performance but also makes large-scale deployment more economically viable, facilitating greater adoption and efficiency.
Scope
Out of scope
Future extension
Data flow diagram
API
Input format
User should pack their binary into byte(int8). For example, for a binary value 0, 1, 1, 0, 0, 0, 1, 1, it will be 99.
Index setting
Because we are using int8 format as input, the dimension should be a multiple of 8. We are going to support new data_type, binary. With binary data type, the hammingdistance is the only space type that we are going to support as of now. If space type is not specified, the hammingdistance will be a default value for the binary data type.
Ingestion
8 bits 0, 0, 0, 0, 1, 0, 1, 0 → 1 byte 10
8 bits 1, 0, 0, 0, 1, 0, 1, 0 → 1 byte -119
8 bits 0, 1, 1, 1, 1, 0, 1, 1 → 1 byte 123
Query
Query vector will have same data format as ingestion which is binary vectors packed in byte(-128 ~ 127)
Reference
Meta issue: #1764
The text was updated successfully, but these errors were encountered: