-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vecindex: redistribute vectors across level during split #135506
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 6 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @mw5h)
pkg/sql/vecindex/vector_index_test.go
line 256 at r1 (raw file):
// Insert vectors into the store. var wait sync.WaitGroup for i := 0; i < vectors.Count; i += s.Options.MaxPartitionSize {
Could we do the inserts in randomly-sized chunks?
pkg/sql/vecindex/vector_index.go
line 624 at r1 (raw file):
defer searchCtx.Workspace.FreeVector(queryVector) vi.quantizer.RandomizeVector( searchCtx.Ctx, queryVector, searchCtx.Randomized, true /* invert */)
Consider changing RandomizeVector
to have inputVector
and outputVector
parameters instead, to avoid this confusion.
pkg/sql/vecindex/fixup_processor.go
line 535 at r1 (raw file):
for i := 0; i < split.Vectors.Count; i++ { if split.Vectors.Count == 1 { // Don't allow so many vectors to be moved that the partition ends
Note: the other PR removes dangling vectors in getFullVectorsForPartition
, but in this PR alone, we could end up with just a dangling vector remaining. Is that fine?
pkg/sql/vecindex/fixup_processor.go
line 537 at r1 (raw file):
// Don't allow so many vectors to be moved that the partition ends // up empty. break
Note: we could potentially move all vectors and just not insert the new partition. That's probably very rare, though.
pkg/sql/vecindex/fixup_processor.go
line 550 at r1 (raw file):
// Get the full vectors for the parent partition's children, if they have // not already been fetched. if parentVectors.Dims == 0 {
[nit] Consider passing a closure that handles this.
pkg/sql/vecindex/fixup_processor.go
line 635 at r1 (raw file):
// the split partition's centroid. if result.QuerySquaredDistance >= result.CentroidDistance*result.CentroidDistance { continue
Can we exploit the ordering on PopResults()
and stop early?
pkg/sql/vecindex/fixup_processor.go
line 730 at r1 (raw file):
tempKeys: fp.searchCtx.tempKeys, tempCounts: fp.searchCtx.tempCounts, tempVectorsWithKeys: fp.searchCtx.tempVectorsWithKeys,
It's worth noting that we'll hold onto the memory of the vector.T
slices in tempVectorsWithKeys
unless we reset each element in the slice. That could get pretty large, right?
pkg/sql/vecindex/testdata/split.ddt
line 136 at r1 (raw file):
└───• vec1 (-2, -2) # Add vectors to partition 2 until it splits and then pulls in vec3 from
Are you sure those were inserted into partition 2? Wouldn't it have been removed from the tree if it split?
5279504
to
c8b9777
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball and @mw5h)
pkg/sql/vecindex/fixup_processor.go
line 535 at r1 (raw file):
Previously, DrewKimball (Drew Kimball) wrote…
Note: the other PR removes dangling vectors in
getFullVectorsForPartition
, but in this PR alone, we could end up with just a dangling vector remaining. Is that fine?
The crucial rule is that we don't want empty non-leaf partitions. Empty leaf partitions are fine, and that's the only place where we'd have a dangling vector. I went ahead and added that check here and beefed up the comment (and added a test that triggers this code path ), just so that it's more obvious why this check exists.
pkg/sql/vecindex/fixup_processor.go
line 537 at r1 (raw file):
Previously, DrewKimball (Drew Kimball) wrote…
Note: we could potentially move all vectors and just not insert the new partition. That's probably very rare, though.
Yeah, it'd be vanishingly rare, so not worth complicating the code.
pkg/sql/vecindex/fixup_processor.go
line 550 at r1 (raw file):
Previously, DrewKimball (Drew Kimball) wrote…
[nit] Consider passing a closure that handles this.
I made the change. It's a bit more complex-looking, but maybe easier to understand?
pkg/sql/vecindex/fixup_processor.go
line 635 at r1 (raw file):
Previously, DrewKimball (Drew Kimball) wrote…
Can we exploit the ordering on
PopResults()
and stop early?
No, just because a result is farther away from the split partition's centroid doesn't mean it's closer than that to its own centroid. However, I realize I can use PopUnsortedResults
here - there's no need to be sorting.
pkg/sql/vecindex/fixup_processor.go
line 730 at r1 (raw file):
Previously, DrewKimball (Drew Kimball) wrote…
It's worth noting that we'll hold onto the memory of the
vector.T
slices intempVectorsWithKeys
unless we reset each element in the slice. That could get pretty large, right?
Its size should be proportional to the max number of vectors in a partition, which is 128. Sure, sometimes partitions can get temporarily bigger than that, but it's still not very large.
pkg/sql/vecindex/vector_index.go
line 624 at r1 (raw file):
Previously, DrewKimball (Drew Kimball) wrote…
Consider changing
RandomizeVector
to haveinputVector
andoutputVector
parameters instead, to avoid this confusion.
Done
pkg/sql/vecindex/testdata/split.ddt
line 136 at r1 (raw file):
Previously, DrewKimball (Drew Kimball) wrote…
Are you sure those were inserted into partition 2? Wouldn't it have been removed from the tree if it split?
The test was previously working, but I made other changes that caused it to behave differently, but didn't realize those changes had invalidated this test. Good catch, fixed.
pkg/sql/vecindex/vector_index_test.go
line 256 at r1 (raw file):
Previously, DrewKimball (Drew Kimball) wrote…
Could we do the inserts in randomly-sized chunks?
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 15 of 15 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andy-kimball and @mw5h)
pkg/sql/vecindex/fixup_processor
line 0 at r2 (raw file):
What's up with this?
pkg/sql/vecindex/quantize/unquantizer.go
line 44 at r2 (raw file):
// RandomizeVector implements the Quantizer interface. func (q *unQuantizer) RandomizeVector( ctx context.Context, input vector.T, output vector.T, invert bool,
Huh, was this previously a bug since it didn't check invert
and reverse the direction of the copy?
c8b9777
to
b7143bc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @DrewKimball and @mw5h)
pkg/sql/vecindex/fixup_processor
line at r2 (raw file):
Previously, DrewKimball (Drew Kimball) wrote…
What's up with this?
Hm, not sure what happened. Fixed.
pkg/sql/vecindex/quantize/unquantizer.go
line 44 at r2 (raw file):
Previously, DrewKimball (Drew Kimball) wrote…
Huh, was this previously a bug since it didn't check
invert
and reverse the direction of the copy?
Yes, it was a bug. I forgot to add a test for this; added now.
When a partition is split, move vectors to closer partitions: * For each vector in the splitting partition, move it to a "sibling" partition if its centroid is closer. * For each nearby vector at the same level as the splitting partition, move it to the splitting partition if its centroid is closer. These optimizations significantly increase the recall rate of searches. Prevent non-leaf-level partitions from becoming empty due to moving vectors between partitions, since that would cause the K-means tree to become unbalanced. Epic: CRDB-42943 Release note: None
b7143bc
to
b56c780
Compare
bors r=drewkimball |
Build succeeded: |
When a partition is split, move vectors to closer partitions:
These optimizations significantly increase the recall rate of searches.
Prevent non-leaf-level partitions from becoming empty due to moving vectors between partitions, since that would cause the K-means tree to become unbalanced.
Epic: CRDB-42943
Release note: None