-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: deflake acceptance/version-upgrade #52750
kvserver: deflake acceptance/version-upgrade #52750
Conversation
I couldn't actually repro the failure observed, but I'm pretty sure this is what's needed. I can try stressing again tomorrow with and without this patch to confirm. I tried mixed version clusters and generating splits manually but still was unable to. I probably wasn't creating abort span bytes in doing so. Was trying:
|
@@ -1036,6 +1036,9 @@ func splitTriggerHelper( | |||
if !rec.ClusterSettings().Version.IsActive(ctx, clusterversion.VersionContainsEstimatesCounter) { | |||
deltaPostSplitLeft.ContainsEstimates = 0 | |||
} | |||
if !rec.ClusterSettings().Version.IsActive(ctx, clusterversion.VersionAbortSpanBytes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure? If my reading of the callers of this code is correct, deltaPostSplitLeft
will essentially be ms
, meaning that the existing zeroing out should override it just fine. Don't you want to massage the RHSDelta
above? That would make more sense to me as it is used to seed the stats for the RHS, and will not go through the existing check.
Looking at the snippet @chrisseto posted:
I'm inclined to say this is the patch we're looking for. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, tricky! Good catch. I think you got the wrong side of the split, but this does seem like a problem. I hope you can reproduce before and after as this is all tricky enough that there could be yet another problem.
Hm, I think we need both. Something like: @@ -1036,6 +1033,10 @@ func splitTriggerHelper(
var pd result.Result
pd.Replicated.Split = &kvserverpb.Split{
SplitTrigger: *split,
// NB: the RHSDelta is identical to the stats for the newly created right
// hand side range (i.e. it goes from zero to its stats).
RHSDelta: *h.AbsPostSplitRight(),
}
deltaPostSplitLeft := h.DeltaPostSplitLeft()
if !rec.ClusterSettings().Version.IsActive(ctx, clusterversion.VersionContainsEstimatesCounter) {
deltaPostSplitLeft.ContainsEstimates = 0
}
if !rec.ClusterSettings().Version.IsActive(ctx, clusterversion.VersionContainsEstimatesCounter) {
deltaPostSplitLeft.ContainsEstimates = 0
}
+ if !rec.ClusterSettings().Version.IsActive(ctx, clusterversion.VersionAbortSpanBytes) {
+ deltaPostSplitLeft.AbortSpanBytes = 0
+ pd.Replicated.Split.RHSDelta.AbortSpanBytes = 0
+ }
return deltaPostSplitLeft, pd, nil
} But yes, I'll have to repro instead of wildly speculating. |
a6021c0
to
858a57b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tbg Added a (contrived) test that confirms the patch. Happy to not check-in the test into our repo, though I did try to generalize it a bit.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @madelineliao and @tbg)
pkg/kv/kvserver/batcheval/cmd_end_transaction.go, line 1039 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Are you sure? If my reading of the callers of this code is correct,
deltaPostSplitLeft
will essentially bems
, meaning that the existing zeroing out should override it just fine. Don't you want to massage theRHSDelta
above? That would make more sense to me as it is used to seed the stats for the RHS, and will not go through the existing check.
Done.
858a57b
to
95256b3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I think we need both.
Interesting, but why? deltaPostSplitLeft
is newMS
in this code here:
cockroach/pkg/kv/kvserver/batcheval/cmd_end_transaction.go
Lines 596 to 599 in 06cd1cb
newMS, trigger, err := splitTrigger( | |
ctx, rec, batch, *ms, ct.SplitTrigger, txn.WriteTimestamp, | |
) | |
*ms = newMS |
so we use it as the "main" stats (which makes sense), so its AbortSpanBytes should be zeroed out by the existing check before proposal. Hmm, your code only zeroes the RHS, so you've come around then?
For the test, I wonder if we can "amend" the existing main version-upgrade test instead of adding a new one? I think it's a good idea to have some more generic splitting going on in that one.
Thanks for the fix!
Reviewed 7 of 7 files at r2, 2 of 2 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @madelineliao)
pkg/kv/kvserver/batcheval/cmd_end_transaction.go, line 1034 at r3 (raw file):
deltaPostSplitLeft := h.DeltaPostSplitLeft() if !rec.ClusterSettings().Version.IsActive(ctx, clusterversion.VersionContainsEstimatesCounter) { deltaPostSplitLeft.ContainsEstimates = 0
Heh, by my same reasoning, this is not necessary but it's necessary on the right. I don't think splits ever come out with ContainsEstimates set, on the RHS, though. Too late now anyway. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, but why?
Yea I was mistaken, I walked through the codepaths after and realized what you were saying. I'll try seeing what the amended version of version-upgrade looks like and abandon this other test, but I might also defer to later given it's holding up the release. We could probably run workload kv
against it in the same way I did here. TFTR!
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @madelineliao)
Fixes cockroachdb#52627. Unskips it as well. In cockroachdb#50938 we were careful to not transmit non-zero MVCCStats through the replica proposal codepaths, but there seems to have been another instance where we end up transmitting MVCCStats across replicas - during splits. We similarly zero AbortSpanBytes out when VersionAbortSpanBytes is inactive to avoid replica divergence. We repro-ed the failure observed in cockroachdb#52627 in this `splits/mixed-version` roachtest that we introduced in the PR, but are choosing not to check in. We'll follow up in a future PR by improving version-upgrade to add splits/workload running against it. Release note: None
95256b3
to
9522fdf
Compare
bors r+ |
🕐 Waiting for PR status (Github check) to be set, probably by CI. Bors will automatically try to run when all required PR statuses are set. |
bors r+ |
Build succeeded: |
Fixes cockroachdb#52907. Unskips it as well. The patch in cockroachdb#52750 was not quite complete. We had zeroed out the replicated RHS delta appropriately, but forgot to consider the RHS stats added to the batch then later used to seed RHS state. I re-verified this patch using the same `splits/mixed-version` test but running for more iterations. Release note: None
53011: batcheval: deflake version-upgrade (again) r=irfansharif a=irfansharif Fixes #52907. Unskips it as well. The patch in #52750 was not quite complete. We had zeroed out the replicated RHS delta appropriately, but forgot to consider the RHS stats added to the batch then later used to seed RHS state. I re-verified this patch using the same `splits/mixed-version` test but running for more iterations. Release note: None Co-authored-by: irfan sharif <[email protected]>
Fixes #52627.
In #50938 we were careful to not transmit non-zero MVCCStats through the
replica proposal codepaths, but there seems to have been another
instance where we end up transmitting MVCCStats across replicas - during
splits. We similarly zero AbortSpanBytes out when VersionAbortSpanBytes
is inactive to avoid replica divergence.
Release note: None