LUCENE-9136: Coarse quantization that reuses existing formats. #1314

jtibshirani · 2020-03-04T00:23:36Z

Note: this PR is just meant to sketch out an idea and is not meant for detailed review.

This PR shows a kNN approach based on coarse quantization (IVFFlat). It adds a new format VectorsFormat, which simply delegates to DocValuesFormat and PostingsFormat under the hood:

The original vectors are stored as BinaryDocValues.
The vectors are also clustered, and the cluster information is stored in postings format. In particular, each cluster centroid is encoded to a BytesRef to represent a term. Each document belonging to the centroid is added to the postings list for that term.

Given a query vector, we first iterate through all the centroid terms to find a small number of closest centroids. We then take the disjunction of all those postings enums to obtain a DocIdSetIterator of candidate nearest neighbors. Finally we score each candidate by loading its vector from BinaryDocValues and computing the distance to the query vector.

There are currently some pretty big hacks:

We re-use the existing doc values and postings formats for simplicity. This is fairly fragile since we write to the same files as normal doc values and postings -- I think there would be a conflict if there were both a vector field and a doc values field with the same name.
To write the postings list, we compute the map from centroid to documents in memory. We then expose it through a hacky Fields implementation called ClusterBackedFields and pass it to the postings writer. It would be better to avoid this hack and not to compute cluster information using a map.

Also switch to a temp directory to avoid having to wipe the index between runs.

jtibshirani · 2020-04-03T20:33:36Z

Benchmarks
In these benchmarks, we find the nearest k=10 vectors and record the recall and queries per second. For the number of centroids, we use the heuristic num centroids = sqrt(dataset size).

sift-128-euclidean: a dataset of 1 million SIFT descriptors with 128 dims.

APPROACH                          RECALL     QPS
LuceneExact()                     1.000        6.425
LuceneCluster(n_probes=5)         0.756      604.133
LuceneCluster(n_probes=10)        0.874      323.791
LuceneCluster(n_probes=20)        0.951      166.580
LuceneCluster(n_probes=50)        0.993       68.465
LuceneCluster(n_probes=100)       0.999       35.139

glove-100-angular: a dataset of ~1.2 million GloVe word vectors of 100 dims.

APPROACH                          RECALL     QPS
LuceneExact()                     1.000        6.764
LuceneCluster(n_probes=5)         0.681      642.247
LuceneCluster(n_probes=10)        0.768      343.067
LuceneCluster(n_probes=20)        0.836      177.037
LuceneCluster(n_probes=50)        0.908       73.256
LuceneCluster(n_probes=100)       0.951       37.302

These benchmarks were performed using the ann-benchmarks repo. The branch and instructions for benchmarking can be found here: jtibshirani/ann-benchmarks#2.

itygh · 2022-09-06T00:24:43Z

这是来自QQ邮箱的假期自动回复邮件。您好，我最近正在休假中，无法亲自回复您的邮件。我将在假期结束后，尽快给您回复。

jtibshirani added 4 commits February 19, 2020 15:02

Add a Lucene 9.0 codec.

3f65514

Add a new vector field type.

cc10e98

Add a vectors format to the codec.

3b53937

Add a python entry point for vector search.

e10d34c

jtibshirani changed the title ~~Coarse quantization~~ A sketch of coarse quantization that reuses existing formats. Mar 4, 2020

jtibshirani changed the title ~~A sketch of coarse quantization that reuses existing formats.~~ Sketch out coarse quantization approach that reuses existing formats. Mar 4, 2020

jtibshirani changed the title ~~Sketch out coarse quantization approach that reuses existing formats.~~ LUCENE-9136: Coarse quantization that reuses existing formats. Mar 4, 2020

This was referenced Apr 3, 2020

Add a runner for the coarse quantization kNN prototype. jtibshirani/ann-benchmarks#2

Closed

Add a runner for Lucene's graph kNN prototype. jtibshirani/ann-benchmarks#1

Closed

Switch to batch querying in the python entry point.

7d68404

Also switch to a temp directory to avoid having to wipe the index between runs.

asfimport mentioned this pull request Apr 20, 2022

Discussing a unified vectors format API [LUCENE-9322] apache/lucene#10362

Closed

jtibshirani closed this Sep 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LUCENE-9136: Coarse quantization that reuses existing formats. #1314

LUCENE-9136: Coarse quantization that reuses existing formats. #1314

jtibshirani commented Mar 4, 2020 •

edited

Loading

jtibshirani commented Apr 3, 2020 •

edited

Loading

itygh commented Sep 6, 2022 via email

LUCENE-9136: Coarse quantization that reuses existing formats. #1314

LUCENE-9136: Coarse quantization that reuses existing formats. #1314

Conversation

jtibshirani commented Mar 4, 2020 • edited Loading

jtibshirani commented Apr 3, 2020 • edited Loading

itygh commented Sep 6, 2022 via email

jtibshirani commented Mar 4, 2020 •

edited

Loading

jtibshirani commented Apr 3, 2020 •

edited

Loading