Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a Better Binary Quantizer format for dense vectors #13651

Closed
wants to merge 163 commits into from
Closed
Show file tree
Hide file tree
Changes from 53 commits
Commits
Show all changes
163 commits
Select commit Hold shift + click to select a range
2c4cca9
iter
benwtrent Aug 12, 2024
d8f1aae
iter
benwtrent Aug 12, 2024
20aa776
iter
benwtrent Aug 12, 2024
df54dde
iter
benwtrent Aug 13, 2024
1b31e3e
iter
benwtrent Aug 13, 2024
9d783ff
iter
benwtrent Aug 13, 2024
3415d52
iter
benwtrent Aug 13, 2024
01acdf2
fleshed out a basic binary quantizer class; needs cleanup/iter
john-wagster Aug 13, 2024
1bf59f4
fleshed out a basic binary quantizer class; needs cleanup/iter
john-wagster Aug 14, 2024
dc0e2aa
iter
benwtrent Aug 14, 2024
71cf39a
iter
benwtrent Aug 14, 2024
938d0ad
iter
benwtrent Aug 14, 2024
91cf834
bin quantizer; cleanup/iter
john-wagster Aug 14, 2024
f6e71d7
iter
benwtrent Aug 15, 2024
d84064a
bin scorer; cleanup/iter
john-wagster Aug 15, 2024
b05f906
bin scorer; cleanup/iter
john-wagster Aug 16, 2024
ecdcd4f
Correct errors in format reading
mayya-sharipova Aug 19, 2024
c56990e
More corrections in format
mayya-sharipova Aug 20, 2024
2499263
bin scorer; cleanup/iter
john-wagster Aug 20, 2024
56b133b
Better centroid re-calculation based on weighted sum
mayya-sharipova Aug 21, 2024
8f4f935
bin scorer; cleanup/iter
john-wagster Aug 22, 2024
0c4d66b
Merge branch 'feature/adv-binarization-format' of github.com:benwtren…
john-wagster Aug 22, 2024
19953fc
Merge branch 'main' into feature/adv-binarization-format
ChrisHegarty Aug 22, 2024
fb5faea
remove export from sandbox module-info
ChrisHegarty Aug 22, 2024
c8d295b
fix warnings: unused, forbidden, lint, headers, etc
ChrisHegarty Aug 22, 2024
88f0219
spotless
ChrisHegarty Aug 22, 2024
f2d2896
vectorize ipByteBin on ARM
ChrisHegarty Aug 22, 2024
5e87c1e
bin scorer; cleanup/iter; merged
john-wagster Aug 22, 2024
8a9a827
format cleanup
ChrisHegarty Aug 22, 2024
2163490
Merge remote-tracking branch 'benwtrent/feature/adv-binarization-form…
ChrisHegarty Aug 22, 2024
9c6f02c
bin scorer; cleanup/iter - fixed bad padding and got assertion check …
john-wagster Aug 22, 2024
11880e6
Address when number of centroids > 1
mayya-sharipova Aug 22, 2024
831ff25
Spotless
mayya-sharipova Aug 22, 2024
bc92a2e
bin scorer; cleanup/iter
john-wagster Aug 22, 2024
4993087
Merge branch 'feature/adv-binarization-format' of github.com:benwtren…
john-wagster Aug 22, 2024
37a541d
bin scorer; cleanup/iter - additional fixmes and cleanup
john-wagster Aug 22, 2024
32241b8
bin scorer; cleanup/iter - setting up for tests
john-wagster Aug 22, 2024
422406a
bin scorer; cleanup/iter - setting up for tests
john-wagster Aug 23, 2024
1e0c321
bin scorer; cleanup/iter - setting up for tests
john-wagster Aug 23, 2024
f4a44fe
bin scorer; cleanup/iter - setting up for tests
john-wagster Aug 23, 2024
43079e0
bin scorer; cleanup/iter - got very basic euclidian tests working
john-wagster Aug 23, 2024
8dfa060
bin scorer; cleanup/iter - spotless
john-wagster Aug 23, 2024
3a16d80
test Panama and default impls of ipByteBin
ChrisHegarty Aug 23, 2024
dd8348a
add boundary value test for ipByteBin
ChrisHegarty Aug 23, 2024
90febd9
more ipByteBin tests
ChrisHegarty Aug 23, 2024
5004405
bin scorer; cleanup/iter - test fixes and clean
john-wagster Aug 23, 2024
1be3f39
Testing multiple clusters
mayya-sharipova Aug 23, 2024
b4937c9
bin scorer; cleanup/iter - introduce mip throughout the reader, write…
john-wagster Aug 24, 2024
3f01539
bin scorer; cleanup/iter - introduce mip throughout the reader, write…
john-wagster Aug 24, 2024
0cbd0f8
bin scorer; cleanup/iter - introduce MIP tests
john-wagster Aug 24, 2024
680d5b0
bin scorer; cleanup/iter - introduce MIP tests
john-wagster Aug 24, 2024
a523661
bin scorer; cleanup/iter - MIP tests working
john-wagster Aug 24, 2024
1b3b7e8
panama128 minor cleanup
ChrisHegarty Aug 26, 2024
d6fc7ce
Fix some errors in HNSW format
mayya-sharipova Aug 26, 2024
665c3dd
Fix another error
mayya-sharipova Aug 26, 2024
abf81ef
Minor test fix
mayya-sharipova Aug 26, 2024
ae429cc
simplify 128 and add 256 panama impls
ChrisHegarty Aug 27, 2024
bcd7037
bin scorer; cleanup/iter - got tests working?
john-wagster Aug 27, 2024
006aa07
Make clusterID of type short, handle multiple clusters during scoring
mayya-sharipova Aug 27, 2024
6c413b3
bin scorer; cleanup/iter - added cache of target factors
john-wagster Aug 28, 2024
d762ddf
bin scorer; cleanup/iter - clean up
john-wagster Aug 28, 2024
447b3df
Fix error in offheap vector values
mayya-sharipova Aug 28, 2024
8eb5163
bin scorer; cleanup/iter - no lru for now
john-wagster Aug 28, 2024
8fa9a41
bin scorer; cleanup/iter - no cache and added tests
john-wagster Aug 29, 2024
6c5b980
Small modifications to tests
mayya-sharipova Aug 29, 2024
2b6a066
Addressing precommit errors
mayya-sharipova Aug 29, 2024
623ec3d
Add basic documentation for build
mayya-sharipova Aug 29, 2024
82a9498
bin scorer; cleanup/iter - clean up, fixes, and added some temporary …
john-wagster Aug 30, 2024
e68a121
Merge branch 'feature/adv-binarization-format' of github.com:benwtren…
john-wagster Aug 30, 2024
23c18af
Fix build failures
mayya-sharipova Aug 30, 2024
2311e0c
bin scorer; cleanup/iter - minor clean up, fixes
john-wagster Sep 1, 2024
1246fba
merge
john-wagster Sep 1, 2024
c426ed0
spotless
john-wagster Sep 1, 2024
5c90e00
fixed test
john-wagster Sep 1, 2024
274fdad
Merge branch 'main' into feature/adv-binarization-format
ChrisHegarty Sep 3, 2024
6ac59b9
Remove default posting format override
ChrisHegarty Sep 3, 2024
b485408
spotless
ChrisHegarty Sep 3, 2024
7788699
Make default number of vectors per cluster static
mayya-sharipova Sep 3, 2024
bd22a92
Add search for Lucene912BinaryQuantizedVectorsReader
mayya-sharipova Sep 3, 2024
df3075d
optimization
benwtrent Sep 3, 2024
5318f10
Merge remote-tracking branch 'refs/remotes/origin/feature/adv-binariz…
benwtrent Sep 3, 2024
5967fbd
bin scorer; cleanup/iter - mip fixes scores recovered
john-wagster Sep 4, 2024
8d7693a
Merge branch 'feature/adv-binarization-format' of github.com:benwtren…
john-wagster Sep 4, 2024
1cb15ab
more fixes
benwtrent Sep 4, 2024
93c252c
Add debug information to writer
mayya-sharipova Sep 4, 2024
42e27cb
adj clustering
benwtrent Sep 4, 2024
f169399
Merge remote-tracking branch 'refs/remotes/origin/feature/adv-binariz…
benwtrent Sep 4, 2024
27520ba
Fix error of quantizing each query vector separately
mayya-sharipova Sep 4, 2024
b782888
bin scorer; cleanup/iter - only store the set amount of corrective va…
john-wagster Sep 4, 2024
42caf1d
iter
benwtrent Sep 5, 2024
cf54e64
Merge remote-tracking branch 'refs/remotes/origin/feature/adv-binariz…
benwtrent Sep 5, 2024
a1f99f0
iter
benwtrent Sep 5, 2024
99f88d1
Tidying
mayya-sharipova Sep 5, 2024
e25107e
Correct how query quantized vectors are accessed in the case of multi…
mayya-sharipova Sep 5, 2024
c783378
fixed how errorbounds are calculated and added mip error bounds calc
john-wagster Sep 5, 2024
4bb5de0
Merge branch 'feature/adv-binarization-format' of github.com:benwtren…
john-wagster Sep 5, 2024
8f0755e
fixed small bug and test
john-wagster Sep 5, 2024
1a1144b
fixed test now that corrective factors are dynamic
john-wagster Sep 5, 2024
f213606
Temprorarily comment out the test about number of vectors in cluster
mayya-sharipova Sep 5, 2024
22c61a1
Fix test with corrections
mayya-sharipova Sep 5, 2024
ca53157
adjusting centroid storage
benwtrent Sep 5, 2024
d60bb48
Merge remote-tracking branch 'refs/remotes/origin/feature/adv-binariz…
benwtrent Sep 5, 2024
c918f1f
fixing some tests
benwtrent Sep 5, 2024
a17f0fd
reverting unnecessary change
benwtrent Sep 5, 2024
2fd2c3a
more corrective factor cleanup
john-wagster Sep 5, 2024
961813b
Spotless
mayya-sharipova Sep 6, 2024
594e427
Correct the test to account some wrong assignment of centroids
mayya-sharipova Sep 6, 2024
39de717
updating testbinaryquantization
john-wagster Sep 6, 2024
13520b5
merging
john-wagster Sep 6, 2024
4a899b5
spotless
john-wagster Sep 6, 2024
febf6cb
Fixing scoring
benwtrent Sep 6, 2024
e5d5db5
Merge remote-tracking branch 'refs/remotes/origin/feature/adv-binariz…
benwtrent Sep 6, 2024
bb706d1
store self centroid dot product alongside each centroid
tteofili Sep 9, 2024
db3d7a9
Merge branch 'feature/adv-binarization-format' of github.com:benwtren…
tteofili Sep 9, 2024
8d0a989
Add basic unit test coverage for BQVectorUtils
ChrisHegarty Sep 9, 2024
5bf8dcc
fixing cosine & dp
benwtrent Sep 9, 2024
998f596
Merge remote-tracking branch 'refs/remotes/origin/feature/adv-binariz…
benwtrent Sep 9, 2024
e1ca1bf
iter
benwtrent Sep 9, 2024
4f463e8
fix error correction for euclidean
benwtrent Sep 9, 2024
5e62f06
clean up
john-wagster Sep 9, 2024
eff98d7
Merge branch 'feature/adv-binarization-format' of github.com:benwtren…
john-wagster Sep 9, 2024
6ba73d8
precision
john-wagster Sep 9, 2024
93e2229
fixed tests
john-wagster Sep 9, 2024
01a2719
Fixing scoring to avoid NaN
benwtrent Sep 9, 2024
33888d3
normalize merged centroids
benwtrent Sep 9, 2024
1497e62
removing bias change
benwtrent Sep 9, 2024
c3f067f
fixed a bug in the ipbytebin dims check which was bypassing panama
john-wagster Sep 11, 2024
2389442
Merge branch 'main' into feature/adv-binarization-format
john-wagster Sep 11, 2024
52d39fd
fixing Search to respect updated interface
john-wagster Sep 11, 2024
31d9634
updating since Records were added
john-wagster Sep 11, 2024
2b55ca9
no-commit add more sandbox helpers
benwtrent Sep 12, 2024
5a3bbd6
fixing cosine & dimension padding handling
benwtrent Sep 12, 2024
fd8e7db
Normalize vectors before clustering for COSINE similarity
mayya-sharipova Sep 17, 2024
f30ee8c
Correct error
mayya-sharipova Sep 17, 2024
9f8108c
Spotless
mayya-sharipova Sep 17, 2024
48a8bd0
Corrections:
mayya-sharipova Sep 17, 2024
3acf852
Fixing centroid merge
benwtrent Sep 17, 2024
4f81956
Cast to long when multiplyExact to avoid integer overflow
mayya-sharipova Sep 18, 2024
0835357
adjusting clustering limitations
benwtrent Sep 19, 2024
c0654ee
fixing ip binning
benwtrent Sep 19, 2024
4bf934d
set minimum to 1M vectors per cluster
benwtrent Sep 19, 2024
6e3f5f8
fixing cdotc storage etc.
benwtrent Sep 20, 2024
9e3b099
Fixing more cdotc optimizations
benwtrent Sep 20, 2024
979caf6
removing unnecessary todo comments
benwtrent Sep 20, 2024
2de38ba
fixing tests
benwtrent Sep 20, 2024
dd0033d
removing multiple centroid support
benwtrent Sep 23, 2024
41fce8d
make merging faster
benwtrent Sep 24, 2024
f9a3fbd
removing unused code
benwtrent Sep 25, 2024
183104d
adjusting unused files
benwtrent Oct 16, 2024
6c1577f
removing unnecessary changes and files
benwtrent Oct 16, 2024
08cd4fa
Merge remote-tracking branch 'upstream/main' into feature/adv-binariz…
benwtrent Oct 16, 2024
e85736d
iter
benwtrent Oct 17, 2024
714531f
Merge remote-tracking branch 'upstream/main' into feature/adv-binariz…
benwtrent Oct 17, 2024
c0abb06
iter
benwtrent Oct 17, 2024
0b821a4
more clean up
benwtrent Oct 17, 2024
3b67850
we did it
benwtrent Oct 18, 2024
9fa97fb
adding CHANGES
benwtrent Oct 18, 2024
1f2f41c
adj changes
benwtrent Oct 18, 2024
f4bef77
fixing up docs
benwtrent Oct 18, 2024
f7b0ec0
addressing pr comments
benwtrent Oct 22, 2024
f903e00
adjusting tests
benwtrent Oct 22, 2024
12340ac
Merge remote-tracking branch 'upstream/main' into feature/adv-binariz…
benwtrent Nov 6, 2024
e562aca
merging in main, fixing tests
benwtrent Nov 6, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -156,7 +156,12 @@ private static String pad(String input) {

/** Process all the JFR files passed in args and print a merged summary. */
public static void printReport(
List<String> files, String mode, int stacksize, int count, boolean lineNumbers, boolean frameTypes)
List<String> files,
String mode,
int stacksize,
int count,
boolean lineNumbers,
boolean frameTypes)
throws IOException {
if (!"cpu".equals(mode) && !"heap".equals(mode)) {
throw new IllegalArgumentException("tests.profile.mode must be one of (cpu,heap)");
Expand Down
1 change: 1 addition & 0 deletions gradle/validation/spotless.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ configure(allprojects) { prj ->

lineEndings 'UNIX'
endWithNewline()
toggleOffOn()
googleJavaFormat(deps.versions.googleJavaFormat.get())

// Apply to all Java sources
Expand Down
4 changes: 3 additions & 1 deletion lucene/core/src/java/module-info.java
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,9 @@
provides org.apache.lucene.codecs.KnnVectorsFormat with
org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsFormat,
org.apache.lucene.codecs.lucene99.Lucene99HnswScalarQuantizedVectorsFormat,
org.apache.lucene.codecs.lucene99.Lucene99ScalarQuantizedVectorsFormat;
org.apache.lucene.codecs.lucene99.Lucene99ScalarQuantizedVectorsFormat,
org.apache.lucene.codecs.lucene912.Lucene912BinaryQuantizedVectorsFormat,
org.apache.lucene.codecs.lucene912.Lucene912HnswBinaryQuantizedVectorsFormat;
provides org.apache.lucene.codecs.PostingsFormat with
org.apache.lucene.codecs.lucene912.Lucene912PostingsFormat;
provides org.apache.lucene.index.SortFieldProvider with
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.lucene.codecs.lucene912;

import java.io.IOException;
import org.apache.lucene.index.ByteVectorValues;
import org.apache.lucene.search.DocIdSetIterator;
import org.apache.lucene.search.VectorScorer;

/**
* A version of {@link ByteVectorValues}, but additionally retrieving score correction values offset
* for binarization quantization scores.
*
* @lucene.experimental
*/
public abstract class BinarizedByteVectorValues extends DocIdSetIterator {
public abstract float getDistanceToCentroid() throws IOException;

/**
* Returns the cluster ID for the vector in the range [-128 to 127]
*
* <p>Negative values should be added to 256 to get a proper cluster id.
*/
public abstract byte clusterId() throws IOException;

public abstract float getMagnitude() throws IOException;

public abstract float getOOQ() throws IOException;

public abstract float getNormOC() throws IOException;

public abstract float getODotC() throws IOException;
Copy link
Member Author

@benwtrent benwtrent Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would really like if we just did float[] and the callers (which are scorers) know how to use it. The downside now is that if something calls getODotC() but it wasn't available, this would blow up right?

@john-wagster

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was naively following the pattern to get things working. This had getMagnitude and getDistanceToCentroid in it prior to adding getOOQ, NormOC, and ODotC. For MIP it only needs OOQ, NormC, and ODotC and essentially ignores (actually overlaps with the other values or 0f). So I don't think it blows up as is but it is storing an extra unnecessary float for Euclidean now. But I could be missing something.

Two thoughts I had and hadn't quite gotten to yet. I think I agree with your thoughts here that this should deal with arbitrary float (or byte) values. I think where we had last landed was some way to decode those values on both read and write where the Scorer? or Format? (via reflection? to avoid versioning problems?) would supply to the Reader and Writer how to serialize and deserialize appropriately based on the similarity function a set of bytes which would eliminate some overhead there (1 byte for Euclidean). I would buy we could update these interfaces to deal with an array of floats in the meantime depending on if quantization of these corrective factors is not a high priority; I was operating right now under the assumption that we do want to try to refactor to store a minimum overhead for each vector (2 bytes / vector in the "Euclidean" and 3 bytes in the "MIP" use-case).

Second thought was I think for all of these values we want to swap them out with 1 byte quantized forms and without having started that work yet it seems pretty straightforward in my head to do so, which is probably why I'm seeing it as part of this first pass.

Was thinking to do both of things together and refactor this to return a getCorrectiveBytes instead of these individual corrective factors and put the burden on the caller to decode them.

There is a separate place where specifically for query (not indexing) instead of pulling through a set of corrective float[] factors we are pulling through an object QueryFactors now, which may be is worth discussing further (but is a different place in the code)? I did this because we could optimize one of the floats to a short and with the belief that we could / want to optimize the other values as well (as well as greatly improving readability). Although subsequent conversations about those with the team suggest that we don't care about quantizing these values (at least not as much) because the query quantization that gets serialized is only serialized to a temporary file. I digress.


public abstract byte[] vectorValue() throws IOException;

/** Return the dimension of the vectors */
public abstract int dimension();

/**
* Return the number of vectors for this field.
*
* @return the number of vectors returned by this iterator
*/
public abstract int size();

@Override
public final long cost() {
return size();
}

/**
* Return a {@link VectorScorer} for the given query vector.
*
* @param query the query vector
* @return a {@link VectorScorer} instance or null
*/
public abstract VectorScorer scorer(float[] query) throws IOException;
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.lucene.codecs.lucene912;

import java.io.IOException;
import org.apache.lucene.codecs.hnsw.FlatVectorsScorer;
import org.apache.lucene.index.VectorSimilarityFunction;
import org.apache.lucene.util.hnsw.RandomVectorScorerSupplier;

public interface BinaryFlatVectorsScorer extends FlatVectorsScorer {

/**
* @param similarityFunction vector similarity function
* @param scoringVectors the vectors over which to score
* @param targetVectors the target vectors
* @return a {@link RandomVectorScorerSupplier} that can be used to score vectors
* @throws IOException if an I/O error occurs
*/
RandomVectorScorerSupplier getRandomVectorScorerSupplier(
VectorSimilarityFunction similarityFunction,
RandomAccessBinarizedQueryByteVectorValues scoringVectors,
RandomAccessBinarizedByteVectorValues targetVectors)
throws IOException;
}
Loading
Loading