-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Hamming distance / binary vector support #81
Comments
At the moment, we do not have a plan to add binary index support. Currently, we are working faiss support #27. Adding binary vector support would be a big project. We would probably need to create a new field type. That being said, to the community, please +1 if you would like to see this feature in the plugin. |
Got it, thanks for the feedback @jmazanec15. So would the added faiss support include support for their binary indexes (e.g. IndexBinaryHNSW)? |
@jaredbrowarnik no, faiss support will not include it. This would be another project. I think from a high level, we might need to:
|
Would very much appreciate this support since having dense float vectors presents a much bigger challenge if trying to scale to 10 billion documents. I also have had a difficult time trying to figure out if ElasticSearch 7.15 would support bit_hamming space for binary vectors or equivalent (e.g. a base64 encoded string) and/or if the Script Score k-NN approach would even be feasible with that many documents (see https://opensearch.org/docs/latest/search-plugins/knn/knn-score-script/#getting-started-with-the-score-script-for-binary-data) Any thoughts on the above? |
+1 |
2 similar comments
+1 |
+1 |
+1 |
1 similar comment
+1 |
As the faiss support has been implemented, can we use the approximate hamming distance on binary types yet? Faiss has a BinaryIndex type: https://github.com/facebookresearch/faiss/wiki/Binary-indexes so I dont think it needs to be a separate project, just that we should be able to use hamming as a space argument when performing a kNN query? |
Hi @abbottdev, yes I think faiss and nmslib support binary indices - we could leverage them. Could you describe your use case a little bit more - what problem space are you using this for? In #949 you had mentioned you had 500M+ 64 bit-binary vectors. How did you decide to use binary over float representations? |
@jmazanec15 - My use case I think may be a little different to the OP. But for me, the case is that we want to use a binary vector database in order to index PDQ image hashes. Because this is a perceptual hash problem, hashes that are "closer" mean they are more relevant, and for the algorithm in question the closeness is determined by the Hamming distance between 2 hashes based on a 256 bit binary vector. So this isn't strictly a standard ML/float vector model. The number of hashes in our instance may not be in the 500m+ range, we would likely be closer to a few hundred thousand. For reference details on the hash if you're curious: https://github.com/facebook/ThreatExchange/tree/main/pdq |
For me the primary value of binary vectors is they take up less space in an index which makes it cheaper to scale up to larger numbers of vectors, e.g. billions. That was my main concern when I asked about this years ago. |
@abbottdev @hubenjm how does the recall look from your experiments operating on binary indices for your use case? |
In our use case the binary value is hash representing some file, and we would like to be able to search for "similar" files/hashes (lowest hamming distance) within a repository of 1B files. |
@vamshin (Please forgive me, I'm not from an ML background so I dont really have any answers here.) We've not used any binary indicies yet because we discarded the option of using FAISS because it didnt fit into our backend stack neatly - but this is the reference implementation that the PDQ solution I linked to above uses. |
I didn't really run any experiments on recall because I abandoned using binary vectors and instead used lower dimensional float vectors (e.g. 128 dimensional). Still takes up a lot more space than 2048 binary ints, but at least there's better support for floats. I still believe this would be a very useful feature if it ever gets prioritized. |
Binary vectors are becoming very relevant these days, see https://txt.cohere.com/int8-binary-embeddings/, https://huggingface.co/blog/embedding-quantization#binary-rescoring, https://blog.pgvecto.rs/my-binary-vector-search-is-better-than-your-fp32-vectors. It would be awesome to have this supported in OpenSearch. |
@frejonb, added to roadmap. We will have the release tagged soon. |
Any updates @vamshin? |
@abbottdev we target this for 2.16. @shatejas looking into it |
Is this going to work with the neural-search plugin? Is the query embedding going to be converted into a bit vector automatically? |
It will work in neural-search if the model can generate the binary embedding with correct format(packed byte). |
Are there any plans to support Hamming distance / efficient binary vector storage when using HNSW-based KNN? It seems like the underlying nmslib has support for it (nmslib/nmslib#306). This would help give parity with binary indexes in faiss: https://github.com/facebookresearch/faiss/wiki/Binary-indexes.
The text was updated successfully, but these errors were encountered: