-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: reconcile manual splitting with automatic merging #37506
Conversation
96a81d8
to
a2d5b8e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job getting this out so quickly! I left a large number of comments, but don't take that as a bad sign -- this is definitely on the right track.
Reviewed 7 of 7 files at r1, 5 of 5 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @jeffrey-xiao)
pkg/internal/client/batch.go, line 606 at r1 (raw file):
}, SplitKey: splitKey, Manual: true,
Leave a comment about why we're setting this.
pkg/kv/split_test.go, line 295 at r1 (raw file):
} func TestRangeSplitsAlreadySplitStickyBit(t *testing.T) {
Give this a comment.
pkg/roachpb/api.proto, line 639 at r1 (raw file):
RequestHeader header = 1 [(gogoproto.nullable) = false, (gogoproto.embed) = true]; bytes split_key = 2 [(gogoproto.casttype) = "Key"]; bool manual = 3;
Give this a comment.
pkg/sql/split_test.go, line 61 at r2 (raw file):
r.Exec(t, "ALTER TABLE d.t SPLIT AT VALUES (1, 'a')") // Prevent the merge queue from immediately discarding our splits.
This comment is echoed all over the place. I think we can remove anything that says this.
pkg/storage/client_merge_test.go, line 3318 at r2 (raw file):
rhsEndKey := roachpb.RKey("c") for _, k := range []roachpb.RKey{lhsStartKey, rhsStartKey, rhsEndKey} { split(t, k.AsRawKey(), false)
Our style guide discusses adding inline comments for constant arguments to functions. So this would become:
split(t, k.AsRawKey(), false /* manual */)
Let's make this change here and below.
pkg/storage/client_merge_test.go, line 3419 at r2 (raw file):
}) t.Run("sticky bit", func(t *testing.T) {
Could you add a TODO right above this to add a subtest for considering load when making merging decisions? It would be nice to test this logic:
cockroach/pkg/storage/merge_queue.go
Lines 258 to 269 in de3ac93
// Check if the merged range would need to be split, if so, skip merge. | |
// Use a lower threshold for load based splitting so we don't find ourselves | |
// in a situation where we keep merging ranges that would be split soon after | |
// by a small increase in load. | |
loadBasedSplitPossible := lhsRepl.SplitByLoadQPSThreshold() < 2*mergedQPS | |
if ok, _ := shouldSplitRange(mergedDesc, mergedStats, lhsRepl.GetMaxBytes(), sysCfg); ok || loadBasedSplitPossible { | |
log.VEventf(ctx, 2, | |
"skipping merge to avoid thrashing: merged range %s may split "+ | |
"(estimated size, estimated QPS: %d, %v)", | |
mergedDesc, mergedStats.Total(), mergedQPS) | |
return nil | |
} |
pkg/storage/client_merge_test.go, line 3424 at r2 (raw file):
verifyUnmerged(t) // Perform manual merge and verify that no merge occurred
Idiomatic Go uses sentences (with proper punctuation) for all comments: https://github.com/golang/go/wiki/CodeReviewComments#comment-sentences.
pkg/storage/client_merge_test.go, line 3430 at r2 (raw file):
verifyUnmerged(t) // Delete sticky bit and verify that merge occurs
Add a TODO to hook this up to the same mechanism that ALTER TABLE ... UNSPLIT AT
uses when we eventually add that.
pkg/storage/merge_queue.go, line 213 at r2 (raw file):
} func (mq *mergeQueue) isManualSplit(ctx context.Context, splitKey roachpb.RKey) (bool, error) {
Give this a comment.
pkg/storage/merge_queue.go, line 313 at r2 (raw file):
} if manualSplit { log.VEventf(ctx, 2, "skipping merge: ranges were manually split")
Please add a TODO to consider returning a "purgatory" error in this case to avoid repeatedly processing ranges that can't be merged.
pkg/storage/replica_command.go, line 226 at r1 (raw file):
} log.Event(ctx, "range already split") // Even if the range is already split, we should still set the sticky bit
This is missing a if args.Manual {
condition.
EDIT: it looks like you added this in the next commit. Let's rebase this fix back to the first one. Also, let's make sure that some test would have failed without this and without the corresponding condition in the main split transaction.
pkg/storage/replica_command.go, line 228 at r1 (raw file):
// Even if the range is already split, we should still set the sticky bit err := r.store.DB().Txn(ctx, func(ctx context.Context, txn *client.Txn) error { b := txn.NewBatch()
This transaction can currently race with a merge to leak sticky bits. In order to serialize with any merge transactions, I think we'll want to first read the range descriptor and make sure that it's equal to the provided desc
. If not, we'll want to retry (see how Replica.AdminSplit
handles ConditionFailedError
s).
For reference, the split transaction below prevents these kinds of races by performing conditional puts on the range descriptor. We're not changing the range descriptor here, so there's no need to perform a CPut, but we'll want similar retry behavior.
pkg/storage/replica_command.go, line 319 at r1 (raw file):
// Add sticky bit if args.Manual { b.Put(keys.SplitStickyBitKey(rightDesc.StartKey), true)
Where did we land on whether this should be a timestamp or a boolean?
pkg/storage/replica_command.go, line 549 at r1 (raw file):
func waitForApplication( ctx context.Context,
Let's get rid of this code movement. It changed in commit 1 and was later reverted. That could trip up a future git blame
, so it's worth fixing in the first commit.
1b857df
to
a0d4028
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @jeffrey-xiao, and @nvanbenschoten)
pkg/storage/replica_command.go, line 319 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Where did we land on whether this should be a timestamp or a boolean?
I think it depends if we want to support temporarily not merging manual splits. We would also need to consider how this time frame can be expressed on the SQL side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @jeffrey-xiao, and @nvanbenschoten)
pkg/internal/client/batch.go, line 606 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Leave a comment about why we're setting this.
Done
pkg/kv/split_test.go, line 295 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Give this a comment.
Done
pkg/roachpb/api.proto, line 639 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Give this a comment.
Done
pkg/storage/client_merge_test.go, line 3318 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Our style guide discusses adding inline comments for constant arguments to functions. So this would become:
split(t, k.AsRawKey(), false /* manual */)
Let's make this change here and below.
Done
pkg/storage/client_merge_test.go, line 3419 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Could you add a TODO right above this to add a subtest for considering load when making merging decisions? It would be nice to test this logic:
cockroach/pkg/storage/merge_queue.go
Lines 258 to 269 in de3ac93
// Check if the merged range would need to be split, if so, skip merge. // Use a lower threshold for load based splitting so we don't find ourselves // in a situation where we keep merging ranges that would be split soon after // by a small increase in load. loadBasedSplitPossible := lhsRepl.SplitByLoadQPSThreshold() < 2*mergedQPS if ok, _ := shouldSplitRange(mergedDesc, mergedStats, lhsRepl.GetMaxBytes(), sysCfg); ok || loadBasedSplitPossible { log.VEventf(ctx, 2, "skipping merge to avoid thrashing: merged range %s may split "+ "(estimated size, estimated QPS: %d, %v)", mergedDesc, mergedStats.Total(), mergedQPS) return nil }
Done
pkg/storage/client_merge_test.go, line 3424 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Idiomatic Go uses sentences (with proper punctuation) for all comments: https://github.com/golang/go/wiki/CodeReviewComments#comment-sentences.
Done
pkg/storage/client_merge_test.go, line 3430 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Add a TODO to hook this up to the same mechanism that
ALTER TABLE ... UNSPLIT AT
uses when we eventually add that.
I'm working on adding ALTER TABLE ... UNSPLIT AT
in a followup commit. Should that be in a separate PR?
pkg/storage/merge_queue.go, line 213 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Give this a comment.
Done
pkg/storage/merge_queue.go, line 313 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Please add a TODO to consider returning a "purgatory" error in this case to avoid repeatedly processing ranges that can't be merged.
Done
pkg/storage/replica_command.go, line 226 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
This is missing a
if args.Manual {
condition.EDIT: it looks like you added this in the next commit. Let's rebase this fix back to the first one. Also, let's make sure that some test would have failed without this and without the corresponding condition in the main split transaction.
Done
pkg/storage/replica_command.go, line 228 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
This transaction can currently race with a merge to leak sticky bits. In order to serialize with any merge transactions, I think we'll want to first read the range descriptor and make sure that it's equal to the provided
desc
. If not, we'll want to retry (see howReplica.AdminSplit
handlesConditionFailedError
s).For reference, the split transaction below prevents these kinds of races by performing conditional puts on the range descriptor. We're not changing the range descriptor here, so there's no need to perform a CPut, but we'll want similar retry behavior.
Done
pkg/storage/replica_command.go, line 549 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Let's get rid of this code movement. It changed in commit 1 and was later reverted. That could trip up a future
git blame
, so it's worth fixing in the first commit.
Done
477fcc0
to
bb64978
Compare
02b967c
to
ad1f871
Compare
1ddff0d
to
b978ff9
Compare
5f9dfe8
to
b720dd3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good! Almost there. The only remaining question in my mind is whether we should represent the sticky bit as a boolean or as a timestamp. Let's discuss that inline below.
Reviewed 86 of 86 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @jeffrey-xiao)
pkg/internal/client/db.go, line 469 at r3 (raw file):
// different range, hence the separation between spanKey and splitKey. See // #16008 for details, and #16344 for the tracking issue to clean this mess up // properly.
Add a paragraph about what the manual
flag means.
pkg/roachpb/api.proto, line 639 at r1 (raw file):
Previously, jeffrey-xiao (Jeffrey Xiao) wrote…
Done
Right now we're just documenting how this is used, and not what the field actually does or why someone would want to set it.
pkg/roachpb/data.proto, line 198 at r3 (raw file):
// StickyBitTrigger indicates that the sticky bit of a range has been changed. // This trigger is used in two cases: // 1. Unsplitting a range.
Define what we mean by "unsplitting" here. We don't want it to be confused with merging.
pkg/roachpb/metadata.proto, line 174 at r3 (raw file):
// replica applying the merge is. optional int64 generation = 6; optional bool sticky_bit = 7;
We can add [(gogoproto.nullable) = false];
to this line to avoid the *bool
in the generated code.
pkg/roachpb/metadata.proto, line 174 at r3 (raw file):
// replica applying the merge is. optional int64 generation = 6; optional bool sticky_bit = 7;
Give this a comment. This will serve as the primary source of documentation on how the sticky bit works, what it's used for, and why it exists at all.
pkg/sql/split.go, line 109 at r3 (raw file):
// because of gossip inconsistency, or it could change halfway through the // SPLIT AT's execution. It is, however, likely to prevent user error and // confusion in the common case.
Add a TODO to remove this in version 20.1.
pkg/sql/split.go, line 131 at r3 (raw file):
} if err := params.extendedEvalCtx.ExecCfg.DB.AdminSplit(params.ctx, rowKey, rowKey, true /* manual */); err != nil {
Discussed in person. Let's set to the value of params.EvalContext().Settings.Version.IsActive(cluster.VersionStickyBit)
.
pkg/storage/client_merge_test.go, line 3430 at r2 (raw file):
Previously, jeffrey-xiao (Jeffrey Xiao) wrote…
I'm working on adding
ALTER TABLE ... UNSPLIT AT
in a followup commit. Should that be in a separate PR?
Yes, this should be in a follow-up PR.
pkg/storage/replica_command.go, line 319 at r1 (raw file):
Previously, jeffrey-xiao (Jeffrey Xiao) wrote…
I think it depends if we want to support temporarily not merging manual splits. We would also need to consider how this time frame can be expressed on the SQL side.
Intuitively I would lean towards making this a timestamp so that we could add an expiration policy to manual splits in the future for without a tricky migration, but doing so will come with a cost. A bool will encode to a single byte while an HLC timestamp will encode to ~8.
@ajwerner I'm curious if you have an opinion on this.
pkg/storage/replica_command.go, line 228 at r3 (raw file):
log.Event(ctx, "range already split") // Even if the range is already split, we should still set the sticky bit if args.Manual {
if args.Manual && !desc.StickyBit {
, right?
pkg/storage/replica_command.go, line 233 at r3 (raw file):
err := r.store.DB().Txn(ctx, func(ctx context.Context, txn *client.Txn) error { b := txn.NewBatch() if err := updateRangeDescriptor(b, keys.RangeDescriptorKey(desc.StartKey), desc, &newDesc); err != nil {
nit: pull descKey := keys.RangeDescriptorKey(desc.StartKey)
out into the previous line.
b720dd3
to
3e7c045
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)
pkg/roachpb/metadata.proto, line 174 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
We can add
[(gogoproto.nullable) = false];
to this line to avoid the*bool
in the generated code.
I don't think I can set [(gogoproto.nullable) = false];
for native fields in proto3. I'm getting an error when attempting to set it, but please correct me if I'm wrong.
pkg/storage/client_merge_test.go, line 3430 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Yes, this should be in a follow-up PR.
Sounds good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @jeffrey-xiao, and @nvanbenschoten)
pkg/roachpb/metadata.proto, line 174 at r3 (raw file):
Previously, jeffrey-xiao (Jeffrey Xiao) wrote…
I don't think I can set
[(gogoproto.nullable) = false];
for native fields in proto3. I'm getting an error when attempting to set it, but please correct me if I'm wrong.
What error are you getting? This file isn't proto3, it's proto2.
pkg/roachpb/metadata.proto, line 174 at r3 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Oh this file is proto2, but I also wanted to keep it consistent with StickyBitTrigger which is a proto3 file. The error I'm getting is:
Another place I'm using a boolean is the Manual flag in AdminSplitRequest. Should I just add the nullable attribute to only the sticky_bit in RangeDescriptor? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jeffrey-xiao and @nvanbenschoten)
pkg/storage/replica_command.go, line 319 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Intuitively I would lean towards making this a timestamp so that we could add an expiration policy to manual splits in the future for without a tricky migration, but doing so will come with a cost. A bool will encode to a single byte while an HLC timestamp will encode to ~8.
@ajwerner I'm curious if you have an opinion on this.
I'm on board with the HLC. Given that start and end keys live in range descriptors and they can be quite large it's probably fine. It will probably also help from a UX perspective. I feel like if I come to a cluster and find it has manual splits I'd be curious about when they were added.
3e7c045
to
1c35801
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 18 of 18 files at r4.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @jeffrey-xiao)
pkg/roachpb/data.proto, line 207 at r4 (raw file):
option (gogoproto.equal) = true; util.hlc.Timestamp sticky_bit = 1;
Add a comment along the lines of // Set to nil to remove a RangeDescriptor's sticky bit.
pkg/storage/replica_command.go, line 228 at r3 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Since the sticky bit is changed to a HLC, we shouldn't check for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @jeffrey-xiao)
pkg/storage/replica_command.go, line 228 at r3 (raw file):
Previously, jeffrey-xiao (Jeffrey Xiao) wrote…
Since the sticky bit is changed to a HLC, we should check for
desc.StickyBit == nil
anymore because we would want to update the HLC right?
That's an interesting point. So you're saying that we should "refresh" the sticky bit to a higher timestamp? Yes, sounds like the right behavior to me.
1c35801
to
1e697b1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really nice work! Update the comments on the SickyBitTrigger and merge it!
Reviewed 1 of 5 files at r2, 66 of 86 files at r3, 15 of 18 files at r4.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @jeffrey-xiao, and @nvanbenschoten)
pkg/roachpb/data.proto, line 196 at r4 (raw file):
} // StickyBitTrigger indicates that the sticky bit of a range has been changed.
nit: s/has been/should be/?
Is there any validation that this actually is the timestamp that's in the range descriptor when this commits? It's probably worth a comment that this is always set to the value used in a CPut for the transaction and that it's the client's responsibility to ensure that the timestamps align.
pkg/roachpb/data.proto, line 207 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Add a comment along the lines of
// Set to nil to remove a RangeDescriptor's sticky bit.
+1
The current design of automatic range merges must be aware of which ranges were automatically or manually split since manual splits should not be automatically merged. The sticky bit is a HLC added to the range descriptor that indicates the range should be not merged into the merge to its left. The sticky bit is set on the RHS of a split operation. If you attempt to split on the start key of a range, then the sticky bit of the range is set, but no split occurs. Release note: None
When ranges are being processed in the merge queue, the sticky bit is checked on the RHS of the merge operation. If the sticky bit is set, then the merge does not go through. Release note: None
Remove all logic that disables merge queue from running in order to preserve splits in tests since the splits should be preserved if they were marked as manual splits. The error outputted when attemping to split while the merge queue is enabled is now gated behind VersionStickyBit. Release note: None
1e697b1
to
d81b1ef
Compare
bors r+ |
37506: storage: reconcile manual splitting with automatic merging r=jeffrey-xiao a=jeffrey-xiao Follows the steps outlined in #37487 to reconcile manual splitting with automatic merging. This PR includes the changes needed at the storage layer. The sticky bit indicating that a range is manually split is added to the range descriptor. 37558: docs/tla-plus: add timestamp refreshes to ParallelCommits spec r=nvanbenschoten a=nvanbenschoten This commit adds transaction timestamp refreshes to the PlusCal model for parallel commits. This allows the committing transaction to bump its timestamp without a required epoch bump. This completes the parallel commits formal specification. 37578: opt, sql: fix type inference of TypeCheck for subqueries r=rytaft a=rytaft Prior to this commit, the optimizer was not correctly inferring the types of columns in subqueries for expressions of the form `scalar IN (subquery)`. This was due to two problems which have now been fixed: 1. The subquery was built as a relational expression before the desired types were known. Now the subquery build is delayed until `TypeCheck` is called for the first time. 2. For subqueries on the right side of an `IN` expression, the desired type passed into `TypeCheck` was `AnyTuple`. This resulted in an error later on in `typeCheckSubqueryWithIn`, which checks to make sure the type of the subquery is `tuple{T}` where `T` is the type of the left expression. Now the desired type passed into `TypeCheck` is `tuple{T}`. Note that this commit only fixes type inference for the optimizer. It is still broken in the heuristic planner. Fixes #37263 Fixes #14554 Release note (bug fix): Fixed type inference of columns in subqueries for some expressions of the form `scalar IN (subquery)`. Co-authored-by: Jeffrey Xiao <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Rebecca Taft <[email protected]>
Build succeeded |
This allows us to remove a lot of complicated logic introduced in cockroachdb#37506 and cockroachdb#28961 around conditionally setting split expirations and disallowing manual splits when the merge queue is enabled.
Follows the steps outlined in #37487 to reconcile manual splitting with automatic merging. This PR includes the changes needed at the storage layer. The sticky bit indicating that a range is manually split is added to the range descriptor.