-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Lucene Byte Sized Vector #952
Comments
@naveentatikonda This is a cool proposal. With this interface, I am wondering a couple of things:
|
@jmazanec15 Thanks for your questions.
|
makes sense. I think calling it byte may be too generic. What if we call it int8, like in C++ typedef? It is being treated as an 8 bit integer, not a binary value. Float can probably remain float (not float32). Something we can change in the future after the release as well. |
Actually, on second thought, byte is consistent with OpenSearch so I am okay: https://opensearch.org/docs/latest/field-types/supported-field-types/numeric/ |
The purpose of this RFC (request for comments) is to gather community feedback on a new proposal of adding support for byte sized vectors in Lucene engine.
Problem Statement
As of today, k-NN plugin only supports vectors of type float for each dimension which is 4 bytes. This is getting expensive in terms of storage especially for those use cases that requires ingestion on a large scale as we need to construct, load, save and search graphs which gets more and more costlier. There are few use cases where customers prefer to reduce memory footprints by trading off recall with a minimal loss.
Using the Lucene ByteVector feature, we can add support for Byte Sized Vector where each dimension of the vector is a byte integer with range [-128 to 127].
How to convert the Float value to Byte ?
Quantization is the process of mapping continuous infinite values to a smaller set of discrete finite values. In the context of simulation and embedded computing, it is about approximating real-world values with a digital representation that introduces limits on the precision and range of a value.
We can make use of the Quantization techniques to convert float values (32 bits) to byte (8 bits) without losing much precision. There are many Quantization techniques such as Scalar Quantization, PQ (used in faiss engine), etc.
As a P0, we are not adding support for any quantization technique because the quantization technique that needs to be used depends on customer user case. So, based on customer request and usage, we will be adding support for quantization technique later.
Proposed Solution
Initially, as we are not planning to support any quantization technique as part of our source code, so the expectation is customers provide us the quantized vectors as input that are of type byte integers within the range of [-128 to 127] for both indexing and querying. So, for users to ingest these vectors using the
KnnByteVectorField
we will be introducing a new optional fielddata_type
in the index mapping. There are no changes to the query mapping.data_type
- Set this asbyte
if we want to index documents as ByteSized vectors; Default value isfloat
The example to create index, ingest and query byte sized vectors is shown below:
Creating Index with
data_type
asbyte
Ingest Documents
Search Query
Also, in approximate search Byte sized vectors are supported only for
lucene
engine. It is not supported fornmslib
andfaiss
engines.Benchmarking on POC
Setup Configuration
Implemented a POC using the Lucene ByteVectorField and ran benchmarks against various datasets. The cluster configuration and index mapping are shown in below table.
Quantization and Normalization
Min-Max Normalization - This technique performs a linear transformation on the original data which scales the values of a feature to a range between 0 and 1. This is done by subtracting the minimum value of the feature from each value, and then dividing by the range of the feature.
Scalar Quantization - Splits the entire space of each dimension into discrete bins in order to reduce the overall memory footprint of the vector data.
Quantization Technique A
For these benchmarking tests, all the datasets that are used have the vectors as float values so normalized them using min-max normalization to transform and scale the values into a range of 0 to 1. Then finally, quantized these values to bucketize them into those 256 buckets (ranging from -128 to 127).
Quantization Technique B
Euclidean distance is shift invariant which means ||x-y||=||(x-z)-(y-z)|| (If we shift both x and y by the same z then the distance remains the same). But, cosine similarity is not (cosine(x, y) does not equal cosine(x-z, y-z)).
So, for the angular datasets to avoid shifting we will follow a different approach to quantize positive and negative values separately(pseudo code shown below for glove-200-angular dataset). There is a huge difference in the recall after using the below technique which improved the recall from 0.17 (for QT A) to 0.77 (using QT B) for glove-200.
Benchmarking Results Comparison
Observations
Ran a test using 1 primary shard and zero replicas. After force merging all the segments into one segment, we can see the segment files with it’s corresponding sizes listed below. The storage space occupied by all the segment files are same for both float vectors and byte vectors except the .vec file which shows that the byte vectors(113 mb) are occupying 1/4 of the size of float vectors (452 mb) which is what we are expecting. But, still we are not seeing the expected number because of the .fdt files which is consuming 1.7 gb for both data types which is nothing but the source data.
Feedback
Please provide your feedback and any questions about the feature are welcome.
The text was updated successfully, but these errors were encountered: