-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Support for Faiss byte vector #1659
Comments
@naveentatikonda what quantization technique is used? |
Scalar Quantization like SQfp16 |
Right, but how do they implement to 8-bit. I dont think they can quantize into fp8 because too much precision would be lost |
Yes, basically they are serializing fp32 values into uint8(0 to 255) which leads to complete loss of precision when they deserialize it back into float. This feature helps to optimize memory at a cost of recall. Also, if the vector dimension is a multiple of 16 they are processing 16 values in each iteration (unlike 8 values that we have seen with fp16), so I'm hoping it might boost the performance and helps to reduce search latencies. |
But how are they doing this? Would they take 0.2212 -> 0? |
Yes, you are right. For For |
From an interface perspective, I think this should then just be byte vector support for faiss. Otherwise, it may confuse users who expect it to behave like Lucene's 8-bit scalar quantization. |
Why are we streaming vectors in batches from JNI to Faiss for byte vector ?With recent changes, we are streaming vectors from java to JNI in batches with a batch size which is 1% of JVM heap size. But, for byte vector we still need to break this down and stream in smaller batches of size 1000 from JNI to Faiss to avoid spike in memory consumption because Scalar Quantizer expects the input vectors as float so we need to cast these byte vectors into floats before ingesting into the index. Why batch size of 1000 ?1000 is not a magic number, ran a test comparing a batch size of 1000 vs batch size of 1 and we can clearly see from the RSS metrics (graph shown below) that force merge was longer with batch size of 1 compared to 1000. The merge time with size 1 is 12.7 min and for 1000 it is 8.8 min using Cohere-1M-768D-InnerProduct dataset. We can start with 1000 and bump it up to 10K(30mb extra memory) later if we want to further reduce this latency. |
Is your feature request related to a problem?
For lucene engine we have Lucene byte vector feature, which accepts byte vectors in the range [-128 to 127] providing memory savings upto 75% when compared with fp32 vectors. But, for large scale workloads we usually prefer to use faiss engine and as of today Faiss only supports fp32 and fp16 vectors(using SQfp16). So, adding byte vector support to faiss engine helps to reduce memory requirements especially for those users who are using LLM like Cohere Embed that generates signed int8 embeddings ranging from [-128 to 127].
What solution would you like?
Add a new Faiss ScalarQuantizer like QT_8bit_direct which doesn't require training and quantizes fp32 vector values (within signed byte range and without any precision) into byte sized vectors reducing memory footprints by a factor of 4.
https://faiss.ai/cpp_api/struct/structfaiss_1_1ScalarQuantizer.html
facebookresearch/faiss#3488
The text was updated successfully, but these errors were encountered: