LUCENE-9136: Coarse quantization that reuses existing formats. #1314
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Note: this PR is just meant to sketch out an idea and is not meant for detailed review.
This PR shows a kNN approach based on coarse quantization (IVFFlat). It adds a new format
VectorsFormat
, which simply delegates toDocValuesFormat
andPostingsFormat
under the hood:BinaryDocValues
.BytesRef
to represent a term. Each document belonging to the centroid is added to the postings list for that term.Given a query vector, we first iterate through all the centroid terms to find a small number of closest centroids. We then take the disjunction of all those postings enums to obtain a DocIdSetIterator of candidate nearest neighbors. Finally we score each candidate by loading its vector from BinaryDocValues and computing the distance to the query vector.
There are currently some pretty big hacks:
Fields
implementation calledClusterBackedFields
and pass it to the postings writer. It would be better to avoid this hack and not to compute cluster information using a map.