[FEATURE] Hamming distance / binary vector support #81

jaredbrowarnik · 2021-08-18T18:47:29Z

Are there any plans to support Hamming distance / efficient binary vector storage when using HNSW-based KNN? It seems like the underlying nmslib has support for it (nmslib/nmslib#306). This would help give parity with binary indexes in faiss: https://github.com/facebookresearch/faiss/wiki/Binary-indexes.

jmazanec15 · 2021-08-19T23:48:09Z

Hi @jaredbrowarnik

At the moment, we do not have a plan to add binary index support. Currently, we are working faiss support #27.

Adding binary vector support would be a big project. We would probably need to create a new field type.

That being said, to the community, please +1 if you would like to see this feature in the plugin.

jaredbrowarnik · 2021-09-08T15:06:43Z

Got it, thanks for the feedback @jmazanec15.

So would the added faiss support include support for their binary indexes (e.g. IndexBinaryHNSW)?

jmazanec15 · 2021-09-14T16:48:41Z

@jaredbrowarnik no, faiss support will not include it. This would be another project. I think from a high level, we might need to:

create a new data type, similar to knn_vector, but for binary data
enhance our existing codec/jni to support binary indices as well
add another query type or enhance the existing one we have

hubenjm · 2021-11-16T04:47:38Z

Would very much appreciate this support since having dense float vectors presents a much bigger challenge if trying to scale to 10 billion documents.

I also have had a difficult time trying to figure out if ElasticSearch 7.15 would support bit_hamming space for binary vectors or equivalent (e.g. a base64 encoded string) and/or if the Script Score k-NN approach would even be feasible with that many documents (see https://opensearch.org/docs/latest/search-plugins/knn/knn-score-script/#getting-started-with-the-score-script-for-binary-data)

Any thoughts on the above?

hubenjm · 2021-11-18T05:39:29Z

+1

prems1891 · 2022-04-11T06:39:58Z

+1

paragor · 2022-08-24T16:18:06Z

+1

TaeWoo21 · 2023-02-21T07:26:14Z

+1

gilamsalem · 2023-07-02T11:57:41Z

+1

abbottdev · 2023-08-03T16:30:23Z

As the faiss support has been implemented, can we use the approximate hamming distance on binary types yet?
#70

Faiss has a BinaryIndex type: https://github.com/facebookresearch/faiss/wiki/Binary-indexes so I dont think it needs to be a separate project, just that we should be able to use hamming as a space argument when performing a kNN query?

jmazanec15 · 2023-08-16T16:51:12Z

Hi @abbottdev, yes I think faiss and nmslib support binary indices - we could leverage them. Could you describe your use case a little bit more - what problem space are you using this for? In #949 you had mentioned you had 500M+ 64 bit-binary vectors. How did you decide to use binary over float representations?

abbottdev · 2023-08-17T09:18:34Z

@jmazanec15 - My use case I think may be a little different to the OP. But for me, the case is that we want to use a binary vector database in order to index PDQ image hashes. Because this is a perceptual hash problem, hashes that are "closer" mean they are more relevant, and for the algorithm in question the closeness is determined by the Hamming distance between 2 hashes based on a 256 bit binary vector. So this isn't strictly a standard ML/float vector model. The number of hashes in our instance may not be in the 500m+ range, we would likely be closer to a few hundred thousand.

For reference details on the hash if you're curious: https://github.com/facebook/ThreatExchange/tree/main/pdq
Page 11 on here: https://github.com/facebook/ThreatExchange/blob/main/hashing/hashing.pdf

hubenjm · 2023-08-17T14:34:14Z

For me the primary value of binary vectors is they take up less space in an index which makes it cheaper to scale up to larger numbers of vectors, e.g. billions. That was my main concern when I asked about this years ago.

vamshin · 2023-08-17T22:39:42Z

@abbottdev @hubenjm how does the recall look from your experiments operating on binary indices for your use case?

gilamsalem · 2023-09-10T06:40:58Z

In our use case the binary value is hash representing some file, and we would like to be able to search for "similar" files/hashes (lowest hamming distance) within a repository of 1B files.
We tested it with the exact search option which supports hamming distance on binary values, and it worked for small number of hashes but doesn't really scale well with higher number.

abbottdev · 2023-09-11T08:14:36Z

@vamshin (Please forgive me, I'm not from an ML background so I dont really have any answers here.) We've not used any binary indicies yet because we discarded the option of using FAISS because it didnt fit into our backend stack neatly - but this is the reference implementation that the PDQ solution I linked to above uses.

hubenjm · 2023-12-09T20:22:21Z

@abbottdev @hubenjm how does the recall look from your experiments operating on binary indices for your use case?

I didn't really run any experiments on recall because I abandoned using binary vectors and instead used lower dimensional float vectors (e.g. 128 dimensional). Still takes up a lot more space than 2048 binary ints, but at least there's better support for floats. I still believe this would be a very useful feature if it ever gets prioritized.

frejonb · 2024-03-27T19:54:19Z

Binary vectors are becoming very relevant these days, see https://txt.cohere.com/int8-binary-embeddings/, https://huggingface.co/blog/embedding-quantization#binary-rescoring, https://blog.pgvecto.rs/my-binary-vector-search-is-better-than-your-fp32-vectors. It would be awesome to have this supported in OpenSearch.

vamshin · 2024-04-01T19:02:12Z

@frejonb, added to roadmap. We will have the release tagged soon.

abbottdev · 2024-05-20T22:40:28Z

Any updates @vamshin?

vamshin · 2024-05-20T22:59:21Z

@abbottdev we target this for 2.16. @shatejas looking into it

asfoorial · 2024-07-18T19:03:23Z

Is this going to work with the neural-search plugin? Is the query embedding going to be converted into a bit vector automatically?

heemin32 · 2024-07-18T20:59:07Z

Is this going to work with the neural-search plugin? Is the query embedding going to be converted into a bit vector automatically?

It will work in neural-search if the model can generate the binary embedding with correct format(packed byte).
The automatic quantization to binary vector will come in #1779

jmazanec15 added the Features Introduces a new unit of functionality that satisfies a requirement label Oct 5, 2022

jmazanec15 mentioned this issue Jul 11, 2023

[RFC] Lucene Byte Sized Vector #952

Closed

jmazanec15 mentioned this issue Aug 1, 2023

[FEATURE] Support for approximate search using hamming distance #949

Closed

vamshin added this to Vector Search RoadMap Aug 22, 2023

github-project-automation bot moved this to Backlog in Vector Search RoadMap Aug 22, 2023

vamshin moved this from Backlog to Backlog (Hot) in Vector Search RoadMap Apr 1, 2024

vamshin added the v2.16.0 label May 31, 2024

vamshin assigned heemin32 May 31, 2024

vamshin moved this from Backlog (Hot) to 2.15.0 in Vector Search RoadMap May 31, 2024

vamshin moved this from 2.15.0 to 2.16.0 in Vector Search RoadMap May 31, 2024

heemin32 mentioned this issue Jun 14, 2024

[META] Binary type support with Faiss engine #1764

Closed

19 tasks

This was referenced Jun 24, 2024

[RFC] Binary vector support #1767

Closed

[RFC] Optimized Disk-Based Vector Search #1779

Closed

vamshin added the k-NN label Jul 15, 2024

jmazanec15 closed this as completed Aug 9, 2024

github-project-automation bot moved this from 2.16.0 to ✅ Done in Vector Search RoadMap Aug 9, 2024

github-project-automation bot added this to OpenSearch Project Roadmap Aug 30, 2024

github-project-automation bot moved this to 2.16 (First RC 07/23, Release 08/06) in OpenSearch Project Roadmap Aug 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Hamming distance / binary vector support #81

[FEATURE] Hamming distance / binary vector support #81

jaredbrowarnik commented Aug 18, 2021

jmazanec15 commented Aug 19, 2021

jaredbrowarnik commented Sep 8, 2021

jmazanec15 commented Sep 14, 2021

hubenjm commented Nov 16, 2021 •

edited

Loading

hubenjm commented Nov 18, 2021

prems1891 commented Apr 11, 2022

paragor commented Aug 24, 2022

TaeWoo21 commented Feb 21, 2023

gilamsalem commented Jul 2, 2023

abbottdev commented Aug 3, 2023

jmazanec15 commented Aug 16, 2023

abbottdev commented Aug 17, 2023 •

edited

Loading

hubenjm commented Aug 17, 2023

vamshin commented Aug 17, 2023

gilamsalem commented Sep 10, 2023 •

edited

Loading

abbottdev commented Sep 11, 2023

hubenjm commented Dec 9, 2023

frejonb commented Mar 27, 2024

vamshin commented Apr 1, 2024

abbottdev commented May 20, 2024

vamshin commented May 20, 2024

asfoorial commented Jul 18, 2024 •

edited

Loading

heemin32 commented Jul 18, 2024 •

edited

Loading

[FEATURE] Hamming distance / binary vector support #81

[FEATURE] Hamming distance / binary vector support #81

Comments

jaredbrowarnik commented Aug 18, 2021

jmazanec15 commented Aug 19, 2021

jaredbrowarnik commented Sep 8, 2021

jmazanec15 commented Sep 14, 2021

hubenjm commented Nov 16, 2021 • edited Loading

hubenjm commented Nov 18, 2021

prems1891 commented Apr 11, 2022

paragor commented Aug 24, 2022

TaeWoo21 commented Feb 21, 2023

gilamsalem commented Jul 2, 2023

abbottdev commented Aug 3, 2023

jmazanec15 commented Aug 16, 2023

abbottdev commented Aug 17, 2023 • edited Loading

hubenjm commented Aug 17, 2023

vamshin commented Aug 17, 2023

gilamsalem commented Sep 10, 2023 • edited Loading

abbottdev commented Sep 11, 2023

hubenjm commented Dec 9, 2023

frejonb commented Mar 27, 2024

vamshin commented Apr 1, 2024

abbottdev commented May 20, 2024

vamshin commented May 20, 2024

asfoorial commented Jul 18, 2024 • edited Loading

heemin32 commented Jul 18, 2024 • edited Loading

hubenjm commented Nov 16, 2021 •

edited

Loading

abbottdev commented Aug 17, 2023 •

edited

Loading

gilamsalem commented Sep 10, 2023 •

edited

Loading

asfoorial commented Jul 18, 2024 •

edited

Loading

heemin32 commented Jul 18, 2024 •

edited

Loading