-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
quantize: add rabitq quantization #134370
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Just a few nits.
Reviewed 7 of 7 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @DrewKimball)
pkg/sql/vecindex/quantize/rabitq.go
line 25 at r1 (raw file):
// "RaBitQ: Quantizing High-Dimensional Vectors with a Theoretical Error Bound // for Approximate Nearest Neighbor Search" by Jianyang Gao & Cheng Long. // URL: https://arxiv.org/pdf/2405.12497
Add this URL to the commit message. Maybe consider committing the paper? We don't know what the future availability of it will be.
pkg/sql/vecindex/quantize/rabitq.go
line 36 at r1 (raw file):
// vector index. This is important, because the ROT matrix is expensive to // generate and can use quite a bit of memory. type raBitQuantizer struct {
I find that comments describing non-obvious struct fields to be very high value. I normally don't like code with a lot of inline comments, but I'm never sad about well documented structs.
pkg/sql/vecindex/quantize/rabitq.go
line 384 at r1 (raw file):
codeBits |= getSignBit(elements[7]) dim += 8
Why is this not part of the for loop decl?
pkg/sql/vecindex/quantize/rabitqpb.go
line 67 at r1 (raw file):
// set to defined values before use. func (cs *RaBitQCodeSet) AddUndefined(count int) { cs.Data = slices.Grow(cs.Data, count*cs.Width)
Maybe I'm overthinking things, but these Grow()/append() operations seem like they could be expensive. Did you consider a either a list or a dictionary of fixed-size pages?
pkg/sql/vecindex/quantize/rabitq_test.go
line 83 at r1 (raw file):
// Edge cases. func TestRaBitQuantizerEdge(t *testing.T) {
Is there any value in testing large vectors (2k-ish elements)?
pkg/sql/vecindex/quantize/rabitq_test.go
line 274 at r1 (raw file):
func loadFeatures(t testing.TB) vector.Set { f, err := os.Open("../features_10000.gob")
Since features is an opaque binary, maybe include some information about what this is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball and @mw5h)
pkg/sql/vecindex/quantize/rabitq.go
line 25 at r1 (raw file):
Previously, mw5h (Matt White) wrote…
Add this URL to the commit message. Maybe consider committing the paper? We don't know what the future availability of it will be.
Done
pkg/sql/vecindex/quantize/rabitq.go
line 36 at r1 (raw file):
Previously, mw5h (Matt White) wrote…
I find that comments describing non-obvious struct fields to be very high value. I normally don't like code with a lot of inline comments, but I'm never sad about well documented structs.
Done.
pkg/sql/vecindex/quantize/rabitq.go
line 384 at r1 (raw file):
Previously, mw5h (Matt White) wrote…
Why is this not part of the for loop decl?
Just to avoid adding 8 twice. I went ahead and changed it, though.
pkg/sql/vecindex/quantize/rabitqpb.go
line 67 at r1 (raw file):
Previously, mw5h (Matt White) wrote…
Maybe I'm overthinking things, but these Grow()/append() operations seem like they could be expensive. Did you consider a either a list or a dictionary of fixed-size pages?
I think slices are the simplest and clearest. However, if perf testing shows that some other data structure out-performs slices, we could switch to them. I didn't see any evidence of this being a perf bottleneck on profiles, though.
pkg/sql/vecindex/quantize/rabitq_test.go
line 83 at r1 (raw file):
Previously, mw5h (Matt White) wrote…
Is there any value in testing large vectors (2k-ish elements)?
I do test 512 byte vectors below. Given how the algorithm works, I don't think there's much additional value to testing even more than that in unit tests. As we do more perf testing, though, we will want to test higher-dimensionality vectors.
pkg/sql/vecindex/quantize/rabitq_test.go
line 274 at r1 (raw file):
Previously, mw5h (Matt White) wrote…
Since features is an opaque binary, maybe include some information about what this is?
I added a descriptive comment.
Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
This commit implements RaBitQ quantization according to the algorithm described in this paper: "RaBitQ: Quantizing High-Dimensional Vectors with a Theoretical Error Bound for Approximate Nearest Neighbor Search" by Jianyang Gao & Cheng Long. URL: https://arxiv.org/pdf/2405.12497 The RaBitQ quantization method provides good accuracy, produces compact codes, provides practical error bounds, is easy to implement, and can be accelerated with fast SIMD instructions. RaBitQ quantization codes use only 1 bit per dimension in the original vector. Epic: CRDB-42943 Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 8 of 8 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 8 files at r2.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @DrewKimball)
bors r=mw5h |
This commit implements RaBitQ quantization according to the algorithm described in this paper:
The RaBitQ quantization method provides good accuracy, produces compact codes, provides practical error bounds, is easy to implement, and can be accelerated with fast SIMD instructions. RaBitQ quantization codes use only 1 bit per dimension in the original vector.
Epic: CRDB-42943
Release note: None