Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

quantize: add rabitq quantization #134370

Merged
merged 1 commit into from
Nov 7, 2024
Merged

Conversation

andy-kimball
Copy link
Contributor

This commit implements RaBitQ quantization according to the algorithm described in this paper:

"RaBitQ: Quantizing High-Dimensional Vectors with a Theoretical Error Bound
for Approximate Nearest Neighbor Search" by Jianyang Gao & Cheng Long.

The RaBitQ quantization method provides good accuracy, produces compact codes, provides practical error bounds, is easy to implement, and can be accelerated with fast SIMD instructions. RaBitQ quantization codes use only 1 bit per dimension in the original vector.

Epic: CRDB-42943

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@mw5h mw5h left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just a few nits.

Reviewed 7 of 7 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @DrewKimball)


pkg/sql/vecindex/quantize/rabitq.go line 25 at r1 (raw file):

//	"RaBitQ: Quantizing High-Dimensional Vectors with a Theoretical Error Bound
//	for Approximate Nearest Neighbor Search" by Jianyang Gao & Cheng Long.
//	URL: https://arxiv.org/pdf/2405.12497

Add this URL to the commit message. Maybe consider committing the paper? We don't know what the future availability of it will be.


pkg/sql/vecindex/quantize/rabitq.go line 36 at r1 (raw file):

// vector index. This is important, because the ROT matrix is expensive to
// generate and can use quite a bit of memory.
type raBitQuantizer struct {

I find that comments describing non-obvious struct fields to be very high value. I normally don't like code with a lot of inline comments, but I'm never sad about well documented structs.


pkg/sql/vecindex/quantize/rabitq.go line 384 at r1 (raw file):

			codeBits |= getSignBit(elements[7])

			dim += 8

Why is this not part of the for loop decl?


pkg/sql/vecindex/quantize/rabitqpb.go line 67 at r1 (raw file):

// set to defined values before use.
func (cs *RaBitQCodeSet) AddUndefined(count int) {
	cs.Data = slices.Grow(cs.Data, count*cs.Width)

Maybe I'm overthinking things, but these Grow()/append() operations seem like they could be expensive. Did you consider a either a list or a dictionary of fixed-size pages?


pkg/sql/vecindex/quantize/rabitq_test.go line 83 at r1 (raw file):

// Edge cases.
func TestRaBitQuantizerEdge(t *testing.T) {

Is there any value in testing large vectors (2k-ish elements)?


pkg/sql/vecindex/quantize/rabitq_test.go line 274 at r1 (raw file):

func loadFeatures(t testing.TB) vector.Set {
	f, err := os.Open("../features_10000.gob")

Since features is an opaque binary, maybe include some information about what this is?

Copy link
Contributor Author

@andy-kimball andy-kimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball and @mw5h)


pkg/sql/vecindex/quantize/rabitq.go line 25 at r1 (raw file):

Previously, mw5h (Matt White) wrote…

Add this URL to the commit message. Maybe consider committing the paper? We don't know what the future availability of it will be.

Done


pkg/sql/vecindex/quantize/rabitq.go line 36 at r1 (raw file):

Previously, mw5h (Matt White) wrote…

I find that comments describing non-obvious struct fields to be very high value. I normally don't like code with a lot of inline comments, but I'm never sad about well documented structs.

Done.


pkg/sql/vecindex/quantize/rabitq.go line 384 at r1 (raw file):

Previously, mw5h (Matt White) wrote…

Why is this not part of the for loop decl?

Just to avoid adding 8 twice. I went ahead and changed it, though.


pkg/sql/vecindex/quantize/rabitqpb.go line 67 at r1 (raw file):

Previously, mw5h (Matt White) wrote…

Maybe I'm overthinking things, but these Grow()/append() operations seem like they could be expensive. Did you consider a either a list or a dictionary of fixed-size pages?

I think slices are the simplest and clearest. However, if perf testing shows that some other data structure out-performs slices, we could switch to them. I didn't see any evidence of this being a perf bottleneck on profiles, though.


pkg/sql/vecindex/quantize/rabitq_test.go line 83 at r1 (raw file):

Previously, mw5h (Matt White) wrote…

Is there any value in testing large vectors (2k-ish elements)?

I do test 512 byte vectors below. Given how the algorithm works, I don't think there's much additional value to testing even more than that in unit tests. As we do more perf testing, though, we will want to test higher-dimensionality vectors.


pkg/sql/vecindex/quantize/rabitq_test.go line 274 at r1 (raw file):

Previously, mw5h (Matt White) wrote…

Since features is an opaque binary, maybe include some information about what this is?

I added a descriptive comment.

Copy link

blathers-crl bot commented Nov 7, 2024

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

This commit implements RaBitQ quantization according to the algorithm described
in this paper:

    "RaBitQ: Quantizing High-Dimensional Vectors with a Theoretical Error Bound
    for Approximate Nearest Neighbor Search" by Jianyang Gao & Cheng Long.
    URL: https://arxiv.org/pdf/2405.12497

The RaBitQ quantization method provides good accuracy, produces compact
codes, provides practical error bounds, is easy to implement, and can be
accelerated with fast SIMD instructions. RaBitQ quantization codes use only
1 bit per dimension in the original vector.

Epic: CRDB-42943

Release note: None
Copy link
Contributor

@mw5h mw5h left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 8 of 8 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball)

Copy link
Contributor

@mw5h mw5h left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Thanks!

Reviewed 1 of 8 files at r2.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @DrewKimball)

@andy-kimball
Copy link
Contributor Author

bors r=mw5h

@craig craig bot merged commit f77bcd0 into cockroachdb:master Nov 7, 2024
23 checks passed
@andy-kimball andy-kimball deleted the rabitq branch November 7, 2024 22:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants