Subvector clustering for approximate L2 similarity #377

alexklibisz · 2020-04-20T16:33:52Z

alexklibisz
Apr 20, 2020
Maintainer

Subvector clustering is introduced in Towards Practical Visual Search Engine Within Elasticsearch as a method for discretizing dense floating point vectors to approximate L2 similarity with great results.

The method works roughly as follows:

Assume d-dimensional floating point vectors.
Each vector is broken into m sub-vectors.
Each sub-vector is clustered into one of k clusters. (k clusters for every sub-vector "region", so m * k total clusters).
The vector is encoded as tuples (sub-vector index, cluster index). These tuples can be "unrolled" into a sparse bool vector with m * k entries.

I think there are two potential approaches to make this work in Elastiknn

Using a mapping:

Define another model called "subvector_clustering" which works with floating point vectors for L2 similarity, e.g.:

{
  "type": "elastiknn_dense_float_vector",
  "elastiknn": {
    "dims": n,
    "model": "subvector_clustering",
    "similarity": "l2",
    "m": 32,
    "k": 16,
    # If you don't provide sample vectors, maybe generate them randomly?
    # It's an interesting question whether this can work totally random vectors defining the clusters.
    "sample_vectors": [ { "values": [0.1, 0.2, ...] }, ... ],

    # Not sure if this part makes sense, but perhaps you would want to specify how your sparse
    # bool vectors are indexed/searched?
    "secondary_model": "sparse_indexed",
    "secondary_model": "lsh",
    "similarity": "hamming",
    "bits": 99
  }
}

Indexing works the same as current models: use the cluster centroids to generate a set of terms encoding (subvector index, cluster index), store those terms just like lsh hashes.
Search works the same as current models: retrieve mapping, generate cluster centroids from the sample_vectors or based on a random seed, generate a set of terms encoding (subvector index, clsuter index), use that in a term query.
The default model for indexing is effectively sparse_indexed, but you could possibly specify another model like Hamming LSH to speed it up even more.

Using a pipeline processor:

This was my first-take idea but after writing it out I think the mapping approach is better and easier.
Create a pipeline processor with parameters: m, k, and a sample of >= k vectors.
Use the sample of vectors to "learn" the clusters for each sub-vector and only store the m * k cluster centroids.
Processor takes a vector, assigns each of its subvectors to a cluster, and encodes the assignments as a sparse bool vector.
The problem is: how do you process a query vector? You would have to remember which pipeline processor you used and somehow reference it in the query.

zhoutong-fu · 2020-08-19T02:13:36Z

zhoutong-fu
Aug 19, 2020

Hi Alex, I was looking into this area recently and happened to see your plugin for ES. This algorithm is also (or more widely) known as product quantization. For very large datasets, you want to consider inverted file with product quantization and similar algorithms are implemented in https://github.com/facebookresearch/faiss for reference.

0 replies

alexklibisz · 2020-08-19T02:52:39Z

alexklibisz
Aug 19, 2020
Maintainer Author

Hi Alex, I was looking into this area recently and happened to see your plugin for ES. This algorithm is also (or more widely) known as product quantization. For very large datasets, you want to consider inverted file with product quantization and similar algorithms are implemented in https://github.com/facebookresearch/faiss for reference.

Hi, thanks! I figured they were related but I read the FAISS paper a long time ago and hadn't had a chance to come back to it. Let me know if you get a chance to use the plugin and have any feedback.

0 replies

bennimmo · 2021-05-07T13:43:56Z

bennimmo
May 7, 2021

+1 for this. I have a production use case I would be interested in trying.

0 replies

alexklibisz · 2021-05-07T14:39:48Z

alexklibisz
May 7, 2021
Maintainer Author

+1 for this. I have a production use case I would be interested in trying.

I'm curious how this would be different than using L2 LSH, other than perhaps having different speed vs. recall characteristics?

To be clear, the subvector clustering algorithm would not provide clusters of vectors. As in, it does not satisfy the usecase of "give me n clusters of all my vectors." Instead, it's using a clustering technique to do approximate nearest neighbor search. It would most likely have the same or very similar API as the LSH methods.

0 replies

bennimmo · 2021-05-07T15:28:11Z

bennimmo
May 7, 2021

Ahh my apologies, I miss read and saw this as "give me n clusters of all my vectors".

Do you have any advice on how one might do that or if it's on your roadmap?

0 replies

alexklibisz · 2021-05-07T21:27:38Z

alexklibisz
May 7, 2021
Maintainer Author

I see, yeah it's overloaded terminology. I think clustering vectors would be technically possible but very slow. Clustering generally requires keeping all data directly accessible in memory and ES/Lucene aren't really designed for that. Also it wouldn't really fit the request/response model of ES.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subvector clustering for approximate L2 similarity #377

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Subvector clustering for approximate L2 similarity #377

alexklibisz Apr 20, 2020 Maintainer

Replies: 6 comments

zhoutong-fu Aug 19, 2020

alexklibisz Aug 19, 2020 Maintainer Author

bennimmo May 7, 2021

alexklibisz May 7, 2021 Maintainer Author

bennimmo May 7, 2021

alexklibisz May 7, 2021 Maintainer Author

alexklibisz
Apr 20, 2020
Maintainer

zhoutong-fu
Aug 19, 2020

alexklibisz
Aug 19, 2020
Maintainer Author

bennimmo
May 7, 2021

alexklibisz
May 7, 2021
Maintainer Author

bennimmo
May 7, 2021

alexklibisz
May 7, 2021
Maintainer Author