Subvector clustering for approximate L2 similarity #377
Replies: 6 comments
-
Hi Alex, I was looking into this area recently and happened to see your plugin for ES. This algorithm is also (or more widely) known as product quantization. For very large datasets, you want to consider inverted file with product quantization and similar algorithms are implemented in https://github.com/facebookresearch/faiss for reference. |
Beta Was this translation helpful? Give feedback.
-
Hi, thanks! I figured they were related but I read the FAISS paper a long time ago and hadn't had a chance to come back to it. Let me know if you get a chance to use the plugin and have any feedback. |
Beta Was this translation helpful? Give feedback.
-
+1 for this. I have a production use case I would be interested in trying. |
Beta Was this translation helpful? Give feedback.
-
I'm curious how this would be different than using L2 LSH, other than perhaps having different speed vs. recall characteristics? To be clear, the subvector clustering algorithm would not provide clusters of vectors. As in, it does not satisfy the usecase of "give me n clusters of all my vectors." Instead, it's using a clustering technique to do approximate nearest neighbor search. It would most likely have the same or very similar API as the LSH methods. |
Beta Was this translation helpful? Give feedback.
-
Ahh my apologies, I miss read and saw this as "give me n clusters of all my vectors". Do you have any advice on how one might do that or if it's on your roadmap? |
Beta Was this translation helpful? Give feedback.
-
I see, yeah it's overloaded terminology. I think clustering vectors would be technically possible but very slow. Clustering generally requires keeping all data directly accessible in memory and ES/Lucene aren't really designed for that. Also it wouldn't really fit the request/response model of ES. |
Beta Was this translation helpful? Give feedback.
-
Subvector clustering is introduced in Towards Practical Visual Search Engine Within Elasticsearch as a method for discretizing dense floating point vectors to approximate L2 similarity with great results.
The method works roughly as follows:
I think there are two potential approaches to make this work in Elastiknn
Using a mapping:
sparse_indexed
, but you could possibly specify another model like Hamming LSH to speed it up even more.Using a pipeline processor:
Beta Was this translation helpful? Give feedback.
All reactions