Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Hamming distance / binary vector support #81

Closed
jaredbrowarnik opened this issue Aug 18, 2021 · 23 comments
Closed

[FEATURE] Hamming distance / binary vector support #81

jaredbrowarnik opened this issue Aug 18, 2021 · 23 comments
Assignees
Labels
Features Introduces a new unit of functionality that satisfies a requirement k-NN v2.16.0

Comments

@jaredbrowarnik
Copy link

Are there any plans to support Hamming distance / efficient binary vector storage when using HNSW-based KNN? It seems like the underlying nmslib has support for it (nmslib/nmslib#306). This would help give parity with binary indexes in faiss: https://github.com/facebookresearch/faiss/wiki/Binary-indexes.

@jmazanec15
Copy link
Member

Hi @jaredbrowarnik

At the moment, we do not have a plan to add binary index support. Currently, we are working faiss support #27.

Adding binary vector support would be a big project. We would probably need to create a new field type.

That being said, to the community, please +1 if you would like to see this feature in the plugin.

@jaredbrowarnik
Copy link
Author

Got it, thanks for the feedback @jmazanec15.

So would the added faiss support include support for their binary indexes (e.g. IndexBinaryHNSW)?

@jmazanec15
Copy link
Member

@jaredbrowarnik no, faiss support will not include it. This would be another project. I think from a high level, we might need to:

  1. create a new data type, similar to knn_vector, but for binary data
  2. enhance our existing codec/jni to support binary indices as well
  3. add another query type or enhance the existing one we have

@hubenjm
Copy link

hubenjm commented Nov 16, 2021

Would very much appreciate this support since having dense float vectors presents a much bigger challenge if trying to scale to 10 billion documents.

I also have had a difficult time trying to figure out if ElasticSearch 7.15 would support bit_hamming space for binary vectors or equivalent (e.g. a base64 encoded string) and/or if the Script Score k-NN approach would even be feasible with that many documents (see https://opensearch.org/docs/latest/search-plugins/knn/knn-score-script/#getting-started-with-the-score-script-for-binary-data)

Any thoughts on the above?

@hubenjm
Copy link

hubenjm commented Nov 18, 2021

+1

2 similar comments
@prems1891
Copy link

+1

@paragor
Copy link

paragor commented Aug 24, 2022

+1

@jmazanec15 jmazanec15 added the Features Introduces a new unit of functionality that satisfies a requirement label Oct 5, 2022
@TaeWoo21
Copy link

+1

1 similar comment
@gilamsalem
Copy link

+1

@abbottdev
Copy link

As the faiss support has been implemented, can we use the approximate hamming distance on binary types yet?
#70

Faiss has a BinaryIndex type: https://github.com/facebookresearch/faiss/wiki/Binary-indexes so I dont think it needs to be a separate project, just that we should be able to use hamming as a space argument when performing a kNN query?

@jmazanec15
Copy link
Member

Hi @abbottdev, yes I think faiss and nmslib support binary indices - we could leverage them. Could you describe your use case a little bit more - what problem space are you using this for? In #949 you had mentioned you had 500M+ 64 bit-binary vectors. How did you decide to use binary over float representations?

@abbottdev
Copy link

abbottdev commented Aug 17, 2023

@jmazanec15 - My use case I think may be a little different to the OP. But for me, the case is that we want to use a binary vector database in order to index PDQ image hashes. Because this is a perceptual hash problem, hashes that are "closer" mean they are more relevant, and for the algorithm in question the closeness is determined by the Hamming distance between 2 hashes based on a 256 bit binary vector. So this isn't strictly a standard ML/float vector model. The number of hashes in our instance may not be in the 500m+ range, we would likely be closer to a few hundred thousand.

For reference details on the hash if you're curious: https://github.com/facebook/ThreatExchange/tree/main/pdq
Page 11 on here: https://github.com/facebook/ThreatExchange/blob/main/hashing/hashing.pdf

@hubenjm
Copy link

hubenjm commented Aug 17, 2023

For me the primary value of binary vectors is they take up less space in an index which makes it cheaper to scale up to larger numbers of vectors, e.g. billions. That was my main concern when I asked about this years ago.

@vamshin
Copy link
Member

vamshin commented Aug 17, 2023

@abbottdev @hubenjm how does the recall look from your experiments operating on binary indices for your use case?

@gilamsalem
Copy link

gilamsalem commented Sep 10, 2023

In our use case the binary value is hash representing some file, and we would like to be able to search for "similar" files/hashes (lowest hamming distance) within a repository of 1B files.
We tested it with the exact search option which supports hamming distance on binary values, and it worked for small number of hashes but doesn't really scale well with higher number.

@abbottdev
Copy link

@vamshin (Please forgive me, I'm not from an ML background so I dont really have any answers here.) We've not used any binary indicies yet because we discarded the option of using FAISS because it didnt fit into our backend stack neatly - but this is the reference implementation that the PDQ solution I linked to above uses.

@hubenjm
Copy link

hubenjm commented Dec 9, 2023

@abbottdev @hubenjm how does the recall look from your experiments operating on binary indices for your use case?

I didn't really run any experiments on recall because I abandoned using binary vectors and instead used lower dimensional float vectors (e.g. 128 dimensional). Still takes up a lot more space than 2048 binary ints, but at least there's better support for floats. I still believe this would be a very useful feature if it ever gets prioritized.

@frejonb
Copy link

frejonb commented Mar 27, 2024

Binary vectors are becoming very relevant these days, see https://txt.cohere.com/int8-binary-embeddings/, https://huggingface.co/blog/embedding-quantization#binary-rescoring, https://blog.pgvecto.rs/my-binary-vector-search-is-better-than-your-fp32-vectors. It would be awesome to have this supported in OpenSearch.

@vamshin vamshin moved this from Backlog to Backlog (Hot) in Vector Search RoadMap Apr 1, 2024
@vamshin
Copy link
Member

vamshin commented Apr 1, 2024

@frejonb, added to roadmap. We will have the release tagged soon.

@abbottdev
Copy link

Any updates @vamshin?

@vamshin
Copy link
Member

vamshin commented May 20, 2024

@abbottdev we target this for 2.16. @shatejas looking into it

@vamshin vamshin moved this from Backlog (Hot) to 2.15.0 in Vector Search RoadMap May 31, 2024
@asfoorial
Copy link

asfoorial commented Jul 18, 2024

Is this going to work with the neural-search plugin? Is the query embedding going to be converted into a bit vector automatically?

@heemin32
Copy link
Collaborator

heemin32 commented Jul 18, 2024

Is this going to work with the neural-search plugin? Is the query embedding going to be converted into a bit vector automatically?

It will work in neural-search if the model can generate the binary embedding with correct format(packed byte).
The automatic quantization to binary vector will come in #1779

@github-project-automation github-project-automation bot moved this from 2.16.0 to ✅ Done in Vector Search RoadMap Aug 9, 2024
@github-project-automation github-project-automation bot moved this to 2.16 (First RC 07/23, Release 08/06) in OpenSearch Project Roadmap Aug 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Features Introduces a new unit of functionality that satisfies a requirement k-NN v2.16.0
Projects
Status: 2.16 (First RC 07/23, Release 08/06)
Status: Done
Development

No branches or pull requests