New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

quantize: add rabitq quantization #134370

Merged

craig merged 1 commit into cockroachdb:master from andy-kimball:rabitq

Nov 7, 2024

Contributor

andy-kimball commented Nov 6, 2024

This commit implements RaBitQ quantization according to the algorithm described in this paper:

"RaBitQ: Quantizing High-Dimensional Vectors with a Theoretical Error Bound
for Approximate Nearest Neighbor Search" by Jianyang Gao & Cheng Long.

The RaBitQ quantization method provides good accuracy, produces compact codes, provides practical error bounds, is easy to implement, and can be accelerated with fast SIMD instructions. RaBitQ quantization codes use only 1 bit per dimension in the original vector.

Epic: CRDB-42943

Release note: None

andy-kimball requested review from DrewKimball and mw5h

November 6, 2024 04:51

Member

cockroach-teamcity commented Nov 6, 2024

This change is

mw5h requested changes

View reviewed changes

Contributor

mw5h left a comment

Looks good. Just a few nits.

Reviewed 7 of 7 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @DrewKimball)

pkg/sql/vecindex/quantize/rabitq.go line 25 at r1 (raw file):

//	"RaBitQ: Quantizing High-Dimensional Vectors with a Theoretical Error Bound
//	for Approximate Nearest Neighbor Search" by Jianyang Gao & Cheng Long.
//	URL: https://arxiv.org/pdf/2405.12497

Add this URL to the commit message. Maybe consider committing the paper? We don't know what the future availability of it will be.

pkg/sql/vecindex/quantize/rabitq.go line 36 at r1 (raw file):

// vector index. This is important, because the ROT matrix is expensive to
// generate and can use quite a bit of memory.
type raBitQuantizer struct {

I find that comments describing non-obvious struct fields to be very high value. I normally don't like code with a lot of inline comments, but I'm never sad about well documented structs.

pkg/sql/vecindex/quantize/rabitq.go line 384 at r1 (raw file):

			codeBits |= getSignBit(elements[7])

			dim += 8

Why is this not part of the for loop decl?

pkg/sql/vecindex/quantize/rabitqpb.go line 67 at r1 (raw file):

// set to defined values before use.
func (cs *RaBitQCodeSet) AddUndefined(count int) {
	cs.Data = slices.Grow(cs.Data, count*cs.Width)

Maybe I'm overthinking things, but these Grow()/append() operations seem like they could be expensive. Did you consider a either a list or a dictionary of fixed-size pages?

pkg/sql/vecindex/quantize/rabitq_test.go line 83 at r1 (raw file):

// Edge cases.
func TestRaBitQuantizerEdge(t *testing.T) {

Is there any value in testing large vectors (2k-ish elements)?

pkg/sql/vecindex/quantize/rabitq_test.go line 274 at r1 (raw file):

func loadFeatures(t testing.TB) vector.Set {
	f, err := os.Open("../features_10000.gob")

Since features is an opaque binary, maybe include some information about what this is?

andy-kimball force-pushed the rabitq branch from 21d0b79 to b225395 Compare

November 6, 2024 21:59

andy-kimball commented

View reviewed changes

Contributor Author

andy-kimball left a comment

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball and @mw5h)

pkg/sql/vecindex/quantize/rabitq.go line 25 at r1 (raw file):

Previously, mw5h (Matt White) wrote…

Add this URL to the commit message. Maybe consider committing the paper? We don't know what the future availability of it will be.

Done

pkg/sql/vecindex/quantize/rabitq.go line 36 at r1 (raw file):

Previously, mw5h (Matt White) wrote…

I find that comments describing non-obvious struct fields to be very high value. I normally don't like code with a lot of inline comments, but I'm never sad about well documented structs.

Done.

pkg/sql/vecindex/quantize/rabitq.go line 384 at r1 (raw file):

Previously, mw5h (Matt White) wrote…

Why is this not part of the for loop decl?

Just to avoid adding 8 twice. I went ahead and changed it, though.

pkg/sql/vecindex/quantize/rabitqpb.go line 67 at r1 (raw file):

Previously, mw5h (Matt White) wrote…

Maybe I'm overthinking things, but these Grow()/append() operations seem like they could be expensive. Did you consider a either a list or a dictionary of fixed-size pages?

I think slices are the simplest and clearest. However, if perf testing shows that some other data structure out-performs slices, we could switch to them. I didn't see any evidence of this being a perf bottleneck on profiles, though.

pkg/sql/vecindex/quantize/rabitq_test.go line 83 at r1 (raw file):

Previously, mw5h (Matt White) wrote…

Is there any value in testing large vectors (2k-ish elements)?

I do test 512 byte vectors below. Given how the algorithm works, I don't think there's much additional value to testing even more than that in unit tests. As we do more perf testing, though, we will want to test higher-dimensionality vectors.

pkg/sql/vecindex/quantize/rabitq_test.go line 274 at r1 (raw file):

Previously, mw5h (Matt White) wrote…

Since features is an opaque binary, maybe include some information about what this is?

I added a descriptive comment.

andy-kimball force-pushed the rabitq branch from b225395 to 6b7947b Compare

November 7, 2024 00:23

blathers-crl bot commented Nov 7, 2024

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}


          quantize: add rabitq quantization

1ec9302

This commit implements RaBitQ quantization according to the algorithm described
in this paper:

    "RaBitQ: Quantizing High-Dimensional Vectors with a Theoretical Error Bound
    for Approximate Nearest Neighbor Search" by Jianyang Gao & Cheng Long.
    URL: https://arxiv.org/pdf/2405.12497

The RaBitQ quantization method provides good accuracy, produces compact
codes, provides practical error bounds, is easy to implement, and can be
accelerated with fast SIMD instructions. RaBitQ quantization codes use only
1 bit per dimension in the original vector.

Epic: CRDB-42943

Release note: None

andy-kimball force-pushed the rabitq branch from 6b7947b to 1ec9302 Compare

November 7, 2024 02:50

mw5h approved these changes

View reviewed changes

Contributor

mw5h left a comment

Reviewed 8 of 8 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball)

mw5h approved these changes

View reviewed changes

Contributor

mw5h left a comment

Thanks!

Reviewed 1 of 8 files at r2.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @DrewKimball)

Contributor Author

andy-kimball commented Nov 7, 2024

bors r=mw5h

Contributor

craig bot commented Nov 7, 2024

Build succeeded:

craig bot merged commit f77bcd0 into cockroachdb:master

23 checks passed

andy-kimball deleted the rabitq branch

November 7, 2024 22:01

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet