Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-9136: Coarse quantization that reuses existing formats. #1314

Closed
wants to merge 5 commits into from

Conversation

jtibshirani
Copy link
Member

@jtibshirani jtibshirani commented Mar 4, 2020

Note: this PR is just meant to sketch out an idea and is not meant for detailed review.

This PR shows a kNN approach based on coarse quantization (IVFFlat). It adds a new format VectorsFormat, which simply delegates to DocValuesFormat and PostingsFormat under the hood:

  • The original vectors are stored as BinaryDocValues.
  • The vectors are also clustered, and the cluster information is stored in postings format. In particular, each cluster centroid is encoded to a BytesRef to represent a term. Each document belonging to the centroid is added to the postings list for that term.

Given a query vector, we first iterate through all the centroid terms to find a small number of closest centroids. We then take the disjunction of all those postings enums to obtain a DocIdSetIterator of candidate nearest neighbors. Finally we score each candidate by loading its vector from BinaryDocValues and computing the distance to the query vector.

There are currently some pretty big hacks:

  • We re-use the existing doc values and postings formats for simplicity. This is fairly fragile since we write to the same files as normal doc values and postings -- I think there would be a conflict if there were both a vector field and a doc values field with the same name.
  • To write the postings list, we compute the map from centroid to documents in memory. We then expose it through a hacky Fields implementation called ClusterBackedFields and pass it to the postings writer. It would be better to avoid this hack and not to compute cluster information using a map.

@jtibshirani jtibshirani changed the title Coarse quantization A sketch of coarse quantization that reuses existing formats. Mar 4, 2020
@jtibshirani jtibshirani changed the title A sketch of coarse quantization that reuses existing formats. Sketch out coarse quantization approach that reuses existing formats. Mar 4, 2020
@jtibshirani jtibshirani changed the title Sketch out coarse quantization approach that reuses existing formats. LUCENE-9136: Coarse quantization that reuses existing formats. Mar 4, 2020
Also switch to a temp directory to avoid having to wipe the index between runs.
@jtibshirani
Copy link
Member Author

jtibshirani commented Apr 3, 2020

Benchmarks
In these benchmarks, we find the nearest k=10 vectors and record the recall and queries per second. For the number of centroids, we use the heuristic num centroids = sqrt(dataset size).

sift-128-euclidean: a dataset of 1 million SIFT descriptors with 128 dims.

APPROACH                          RECALL     QPS
LuceneExact()                     1.000        6.425
LuceneCluster(n_probes=5)         0.756      604.133
LuceneCluster(n_probes=10)        0.874      323.791
LuceneCluster(n_probes=20)        0.951      166.580
LuceneCluster(n_probes=50)        0.993       68.465
LuceneCluster(n_probes=100)       0.999       35.139

glove-100-angular: a dataset of ~1.2 million GloVe word vectors of 100 dims.

APPROACH                          RECALL     QPS
LuceneExact()                     1.000        6.764
LuceneCluster(n_probes=5)         0.681      642.247
LuceneCluster(n_probes=10)        0.768      343.067
LuceneCluster(n_probes=20)        0.836      177.037
LuceneCluster(n_probes=50)        0.908       73.256
LuceneCluster(n_probes=100)       0.951       37.302

These benchmarks were performed using the ann-benchmarks repo. The branch and instructions for benchmarking can be found here: jtibshirani/ann-benchmarks#2.

@itygh
Copy link

itygh commented Sep 6, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants