-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a Better Binary Quantizer format for dense vectors #13651
Closed
benwtrent
wants to merge
163
commits into
apache:main
from
benwtrent:feature/adv-binarization-format
Closed
Changes from 53 commits
Commits
Show all changes
163 commits
Select commit
Hold shift + click to select a range
2c4cca9
iter
benwtrent d8f1aae
iter
benwtrent 20aa776
iter
benwtrent df54dde
iter
benwtrent 1b31e3e
iter
benwtrent 9d783ff
iter
benwtrent 3415d52
iter
benwtrent 01acdf2
fleshed out a basic binary quantizer class; needs cleanup/iter
john-wagster 1bf59f4
fleshed out a basic binary quantizer class; needs cleanup/iter
john-wagster dc0e2aa
iter
benwtrent 71cf39a
iter
benwtrent 938d0ad
iter
benwtrent 91cf834
bin quantizer; cleanup/iter
john-wagster f6e71d7
iter
benwtrent d84064a
bin scorer; cleanup/iter
john-wagster b05f906
bin scorer; cleanup/iter
john-wagster ecdcd4f
Correct errors in format reading
mayya-sharipova c56990e
More corrections in format
mayya-sharipova 2499263
bin scorer; cleanup/iter
john-wagster 56b133b
Better centroid re-calculation based on weighted sum
mayya-sharipova 8f4f935
bin scorer; cleanup/iter
john-wagster 0c4d66b
Merge branch 'feature/adv-binarization-format' of github.com:benwtren…
john-wagster 19953fc
Merge branch 'main' into feature/adv-binarization-format
ChrisHegarty fb5faea
remove export from sandbox module-info
ChrisHegarty c8d295b
fix warnings: unused, forbidden, lint, headers, etc
ChrisHegarty 88f0219
spotless
ChrisHegarty f2d2896
vectorize ipByteBin on ARM
ChrisHegarty 5e87c1e
bin scorer; cleanup/iter; merged
john-wagster 8a9a827
format cleanup
ChrisHegarty 2163490
Merge remote-tracking branch 'benwtrent/feature/adv-binarization-form…
ChrisHegarty 9c6f02c
bin scorer; cleanup/iter - fixed bad padding and got assertion check …
john-wagster 11880e6
Address when number of centroids > 1
mayya-sharipova 831ff25
Spotless
mayya-sharipova bc92a2e
bin scorer; cleanup/iter
john-wagster 4993087
Merge branch 'feature/adv-binarization-format' of github.com:benwtren…
john-wagster 37a541d
bin scorer; cleanup/iter - additional fixmes and cleanup
john-wagster 32241b8
bin scorer; cleanup/iter - setting up for tests
john-wagster 422406a
bin scorer; cleanup/iter - setting up for tests
john-wagster 1e0c321
bin scorer; cleanup/iter - setting up for tests
john-wagster f4a44fe
bin scorer; cleanup/iter - setting up for tests
john-wagster 43079e0
bin scorer; cleanup/iter - got very basic euclidian tests working
john-wagster 8dfa060
bin scorer; cleanup/iter - spotless
john-wagster 3a16d80
test Panama and default impls of ipByteBin
ChrisHegarty dd8348a
add boundary value test for ipByteBin
ChrisHegarty 90febd9
more ipByteBin tests
ChrisHegarty 5004405
bin scorer; cleanup/iter - test fixes and clean
john-wagster 1be3f39
Testing multiple clusters
mayya-sharipova b4937c9
bin scorer; cleanup/iter - introduce mip throughout the reader, write…
john-wagster 3f01539
bin scorer; cleanup/iter - introduce mip throughout the reader, write…
john-wagster 0cbd0f8
bin scorer; cleanup/iter - introduce MIP tests
john-wagster 680d5b0
bin scorer; cleanup/iter - introduce MIP tests
john-wagster a523661
bin scorer; cleanup/iter - MIP tests working
john-wagster 1b3b7e8
panama128 minor cleanup
ChrisHegarty d6fc7ce
Fix some errors in HNSW format
mayya-sharipova 665c3dd
Fix another error
mayya-sharipova abf81ef
Minor test fix
mayya-sharipova ae429cc
simplify 128 and add 256 panama impls
ChrisHegarty bcd7037
bin scorer; cleanup/iter - got tests working?
john-wagster 006aa07
Make clusterID of type short, handle multiple clusters during scoring
mayya-sharipova 6c413b3
bin scorer; cleanup/iter - added cache of target factors
john-wagster d762ddf
bin scorer; cleanup/iter - clean up
john-wagster 447b3df
Fix error in offheap vector values
mayya-sharipova 8eb5163
bin scorer; cleanup/iter - no lru for now
john-wagster 8fa9a41
bin scorer; cleanup/iter - no cache and added tests
john-wagster 6c5b980
Small modifications to tests
mayya-sharipova 2b6a066
Addressing precommit errors
mayya-sharipova 623ec3d
Add basic documentation for build
mayya-sharipova 82a9498
bin scorer; cleanup/iter - clean up, fixes, and added some temporary …
john-wagster e68a121
Merge branch 'feature/adv-binarization-format' of github.com:benwtren…
john-wagster 23c18af
Fix build failures
mayya-sharipova 2311e0c
bin scorer; cleanup/iter - minor clean up, fixes
john-wagster 1246fba
merge
john-wagster c426ed0
spotless
john-wagster 5c90e00
fixed test
john-wagster 274fdad
Merge branch 'main' into feature/adv-binarization-format
ChrisHegarty 6ac59b9
Remove default posting format override
ChrisHegarty b485408
spotless
ChrisHegarty 7788699
Make default number of vectors per cluster static
mayya-sharipova bd22a92
Add search for Lucene912BinaryQuantizedVectorsReader
mayya-sharipova df3075d
optimization
benwtrent 5318f10
Merge remote-tracking branch 'refs/remotes/origin/feature/adv-binariz…
benwtrent 5967fbd
bin scorer; cleanup/iter - mip fixes scores recovered
john-wagster 8d7693a
Merge branch 'feature/adv-binarization-format' of github.com:benwtren…
john-wagster 1cb15ab
more fixes
benwtrent 93c252c
Add debug information to writer
mayya-sharipova 42e27cb
adj clustering
benwtrent f169399
Merge remote-tracking branch 'refs/remotes/origin/feature/adv-binariz…
benwtrent 27520ba
Fix error of quantizing each query vector separately
mayya-sharipova b782888
bin scorer; cleanup/iter - only store the set amount of corrective va…
john-wagster 42caf1d
iter
benwtrent cf54e64
Merge remote-tracking branch 'refs/remotes/origin/feature/adv-binariz…
benwtrent a1f99f0
iter
benwtrent 99f88d1
Tidying
mayya-sharipova e25107e
Correct how query quantized vectors are accessed in the case of multi…
mayya-sharipova c783378
fixed how errorbounds are calculated and added mip error bounds calc
john-wagster 4bb5de0
Merge branch 'feature/adv-binarization-format' of github.com:benwtren…
john-wagster 8f0755e
fixed small bug and test
john-wagster 1a1144b
fixed test now that corrective factors are dynamic
john-wagster f213606
Temprorarily comment out the test about number of vectors in cluster
mayya-sharipova 22c61a1
Fix test with corrections
mayya-sharipova ca53157
adjusting centroid storage
benwtrent d60bb48
Merge remote-tracking branch 'refs/remotes/origin/feature/adv-binariz…
benwtrent c918f1f
fixing some tests
benwtrent a17f0fd
reverting unnecessary change
benwtrent 2fd2c3a
more corrective factor cleanup
john-wagster 961813b
Spotless
mayya-sharipova 594e427
Correct the test to account some wrong assignment of centroids
mayya-sharipova 39de717
updating testbinaryquantization
john-wagster 13520b5
merging
john-wagster 4a899b5
spotless
john-wagster febf6cb
Fixing scoring
benwtrent e5d5db5
Merge remote-tracking branch 'refs/remotes/origin/feature/adv-binariz…
benwtrent bb706d1
store self centroid dot product alongside each centroid
tteofili db3d7a9
Merge branch 'feature/adv-binarization-format' of github.com:benwtren…
tteofili 8d0a989
Add basic unit test coverage for BQVectorUtils
ChrisHegarty 5bf8dcc
fixing cosine & dp
benwtrent 998f596
Merge remote-tracking branch 'refs/remotes/origin/feature/adv-binariz…
benwtrent e1ca1bf
iter
benwtrent 4f463e8
fix error correction for euclidean
benwtrent 5e62f06
clean up
john-wagster eff98d7
Merge branch 'feature/adv-binarization-format' of github.com:benwtren…
john-wagster 6ba73d8
precision
john-wagster 93e2229
fixed tests
john-wagster 01a2719
Fixing scoring to avoid NaN
benwtrent 33888d3
normalize merged centroids
benwtrent 1497e62
removing bias change
benwtrent c3f067f
fixed a bug in the ipbytebin dims check which was bypassing panama
john-wagster 2389442
Merge branch 'main' into feature/adv-binarization-format
john-wagster 52d39fd
fixing Search to respect updated interface
john-wagster 31d9634
updating since Records were added
john-wagster 2b55ca9
no-commit add more sandbox helpers
benwtrent 5a3bbd6
fixing cosine & dimension padding handling
benwtrent fd8e7db
Normalize vectors before clustering for COSINE similarity
mayya-sharipova f30ee8c
Correct error
mayya-sharipova 9f8108c
Spotless
mayya-sharipova 48a8bd0
Corrections:
mayya-sharipova 3acf852
Fixing centroid merge
benwtrent 4f81956
Cast to long when multiplyExact to avoid integer overflow
mayya-sharipova 0835357
adjusting clustering limitations
benwtrent c0654ee
fixing ip binning
benwtrent 4bf934d
set minimum to 1M vectors per cluster
benwtrent 6e3f5f8
fixing cdotc storage etc.
benwtrent 9e3b099
Fixing more cdotc optimizations
benwtrent 979caf6
removing unnecessary todo comments
benwtrent 2de38ba
fixing tests
benwtrent dd0033d
removing multiple centroid support
benwtrent 41fce8d
make merging faster
benwtrent f9a3fbd
removing unused code
benwtrent 183104d
adjusting unused files
benwtrent 6c1577f
removing unnecessary changes and files
benwtrent 08cd4fa
Merge remote-tracking branch 'upstream/main' into feature/adv-binariz…
benwtrent e85736d
iter
benwtrent 714531f
Merge remote-tracking branch 'upstream/main' into feature/adv-binariz…
benwtrent c0abb06
iter
benwtrent 0b821a4
more clean up
benwtrent 3b67850
we did it
benwtrent 9fa97fb
adding CHANGES
benwtrent 1f2f41c
adj changes
benwtrent f4bef77
fixing up docs
benwtrent f7b0ec0
addressing pr comments
benwtrent f903e00
adjusting tests
benwtrent 12340ac
Merge remote-tracking branch 'upstream/main' into feature/adv-binariz…
benwtrent e562aca
merging in main, fixing tests
benwtrent File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
72 changes: 72 additions & 0 deletions
72
lucene/core/src/java/org/apache/lucene/codecs/lucene912/BinarizedByteVectorValues.java
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
/* | ||
* Licensed to the Apache Software Foundation (ASF) under one or more | ||
* contributor license agreements. See the NOTICE file distributed with | ||
* this work for additional information regarding copyright ownership. | ||
* The ASF licenses this file to You under the Apache License, Version 2.0 | ||
* (the "License"); you may not use this file except in compliance with | ||
* the License. You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
package org.apache.lucene.codecs.lucene912; | ||
|
||
import java.io.IOException; | ||
import org.apache.lucene.index.ByteVectorValues; | ||
import org.apache.lucene.search.DocIdSetIterator; | ||
import org.apache.lucene.search.VectorScorer; | ||
|
||
/** | ||
* A version of {@link ByteVectorValues}, but additionally retrieving score correction values offset | ||
* for binarization quantization scores. | ||
* | ||
* @lucene.experimental | ||
*/ | ||
public abstract class BinarizedByteVectorValues extends DocIdSetIterator { | ||
public abstract float getDistanceToCentroid() throws IOException; | ||
|
||
/** | ||
* Returns the cluster ID for the vector in the range [-128 to 127] | ||
* | ||
* <p>Negative values should be added to 256 to get a proper cluster id. | ||
*/ | ||
public abstract byte clusterId() throws IOException; | ||
|
||
public abstract float getMagnitude() throws IOException; | ||
|
||
public abstract float getOOQ() throws IOException; | ||
|
||
public abstract float getNormOC() throws IOException; | ||
|
||
public abstract float getODotC() throws IOException; | ||
|
||
public abstract byte[] vectorValue() throws IOException; | ||
|
||
/** Return the dimension of the vectors */ | ||
public abstract int dimension(); | ||
|
||
/** | ||
* Return the number of vectors for this field. | ||
* | ||
* @return the number of vectors returned by this iterator | ||
*/ | ||
public abstract int size(); | ||
|
||
@Override | ||
public final long cost() { | ||
return size(); | ||
} | ||
|
||
/** | ||
* Return a {@link VectorScorer} for the given query vector. | ||
* | ||
* @param query the query vector | ||
* @return a {@link VectorScorer} instance or null | ||
*/ | ||
public abstract VectorScorer scorer(float[] query) throws IOException; | ||
} |
38 changes: 38 additions & 0 deletions
38
lucene/core/src/java/org/apache/lucene/codecs/lucene912/BinaryFlatVectorsScorer.java
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
/* | ||
* Licensed to the Apache Software Foundation (ASF) under one or more | ||
* contributor license agreements. See the NOTICE file distributed with | ||
* this work for additional information regarding copyright ownership. | ||
* The ASF licenses this file to You under the Apache License, Version 2.0 | ||
* (the "License"); you may not use this file except in compliance with | ||
* the License. You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
package org.apache.lucene.codecs.lucene912; | ||
|
||
import java.io.IOException; | ||
import org.apache.lucene.codecs.hnsw.FlatVectorsScorer; | ||
import org.apache.lucene.index.VectorSimilarityFunction; | ||
import org.apache.lucene.util.hnsw.RandomVectorScorerSupplier; | ||
|
||
public interface BinaryFlatVectorsScorer extends FlatVectorsScorer { | ||
|
||
/** | ||
* @param similarityFunction vector similarity function | ||
* @param scoringVectors the vectors over which to score | ||
* @param targetVectors the target vectors | ||
* @return a {@link RandomVectorScorerSupplier} that can be used to score vectors | ||
* @throws IOException if an I/O error occurs | ||
*/ | ||
RandomVectorScorerSupplier getRandomVectorScorerSupplier( | ||
VectorSimilarityFunction similarityFunction, | ||
RandomAccessBinarizedQueryByteVectorValues scoringVectors, | ||
RandomAccessBinarizedByteVectorValues targetVectors) | ||
throws IOException; | ||
} |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would really like if we just did
float[]
and the callers (which are scorers) know how to use it. The downside now is that if something callsgetODotC()
but it wasn't available, this would blow up right?@john-wagster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was naively following the pattern to get things working. This had getMagnitude and getDistanceToCentroid in it prior to adding getOOQ, NormOC, and ODotC. For MIP it only needs OOQ, NormC, and ODotC and essentially ignores (actually overlaps with the other values or 0f). So I don't think it blows up as is but it is storing an extra unnecessary float for Euclidean now. But I could be missing something.
Two thoughts I had and hadn't quite gotten to yet. I think I agree with your thoughts here that this should deal with arbitrary float (or byte) values. I think where we had last landed was some way to decode those values on both read and write where the Scorer? or Format? (via reflection? to avoid versioning problems?) would supply to the Reader and Writer how to serialize and deserialize appropriately based on the similarity function a set of bytes which would eliminate some overhead there (1 byte for Euclidean). I would buy we could update these interfaces to deal with an array of floats in the meantime depending on if quantization of these corrective factors is not a high priority; I was operating right now under the assumption that we do want to try to refactor to store a minimum overhead for each vector (2 bytes / vector in the "Euclidean" and 3 bytes in the "MIP" use-case).
Second thought was I think for all of these values we want to swap them out with 1 byte quantized forms and without having started that work yet it seems pretty straightforward in my head to do so, which is probably why I'm seeing it as part of this first pass.
Was thinking to do both of things together and refactor this to return a getCorrectiveBytes instead of these individual corrective factors and put the burden on the caller to decode them.
There is a separate place where specifically for query (not indexing) instead of pulling through a set of corrective float[] factors we are pulling through an object QueryFactors now, which may be is worth discussing further (but is a different place in the code)? I did this because we could optimize one of the floats to a short and with the belief that we could / want to optimize the other values as well (as well as greatly improving readability). Although subsequent conversations about those with the team suggest that we don't care about quantizing these values (at least not as much) because the query quantization that gets serialized is only serialized to a temporary file. I digress.