Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: deflake acceptance/version-upgrade #52750

Merged

Conversation

irfansharif
Copy link
Contributor

Fixes #52627.

In #50938 we were careful to not transmit non-zero MVCCStats through the
replica proposal codepaths, but there seems to have been another
instance where we end up transmitting MVCCStats across replicas - during
splits. We similarly zero AbortSpanBytes out when VersionAbortSpanBytes
is inactive to avoid replica divergence.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@irfansharif
Copy link
Contributor Author

I couldn't actually repro the failure observed, but I'm pretty sure this is what's needed. I can try stressing again tomorrow with and without this patch to confirm. I tried mixed version clusters and generating splits manually but still was unable to. I probably wasn't creating abort span bytes in doing so. Was trying:

 make buildshort; 
 roachprod destroy local; 
 roachprod create local -n 4; 
 roachprod put local:1-3 (which cockroach20.1) ./cockroach; 
 roachprod put local:4 bin/workload ./workload; 
 roachprod start local:1-3
 roachprod run local:4 './workload run kv --init --splits 10000 --tolerate-errors {pgurl:2}' &
 echo "sleeping..";
 sleep 3;
 roachprod put local:2 ./cockroachshort ./cockroach; 
 roachprod stop local:2;
 roachprod start local:2;

@@ -1036,6 +1036,9 @@ func splitTriggerHelper(
if !rec.ClusterSettings().Version.IsActive(ctx, clusterversion.VersionContainsEstimatesCounter) {
deltaPostSplitLeft.ContainsEstimates = 0
}
if !rec.ClusterSettings().Version.IsActive(ctx, clusterversion.VersionAbortSpanBytes) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure? If my reading of the callers of this code is correct, deltaPostSplitLeft will essentially be ms, meaning that the existing zeroing out should override it just fine. Don't you want to massage the RHSDelta above? That would make more sense to me as it is used to seed the stats for the RHS, and will not go through the existing check.

@irfansharif
Copy link
Contributor Author

Looking at the snippet @chrisseto posted:

I200812 19:15:29.100791 4262 kv/kvserver/replica_command.go:414  [n1,s1,r11/1:/Table/1{5-6}] initiating a split of this range at key /Table/15/3 [r997] (manual)
...
E200812 19:16:07.986964 6820 kv/kvserver/replica_consistency.go:162  [n1,consistencyChecker,s1,r997/1:/Table/1{5/3-6}] (n1,s1):1: checksum a67704dbec62d0c7a063e8528bf2cb2e4dc3ce4f83d0b700b16c271f27afe7e52b7533774d9607c33ce5c8d8d7f3ec91cdf7571a3d7d7b0ed197c7df4a0d4214 [minority]

I'm inclined to say this is the patch we're looking for.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, tricky! Good catch. I think you got the wrong side of the split, but this does seem like a problem. I hope you can reproduce before and after as this is all tricky enough that there could be yet another problem.

@irfansharif
Copy link
Contributor Author

Hm, I think we need both. Something like:

@@ -1036,6 +1033,10 @@ func splitTriggerHelper(
        var pd result.Result
        pd.Replicated.Split = &kvserverpb.Split{
          SplitTrigger: *split,
          // NB: the RHSDelta is identical to the stats for the newly created right
          // hand side range (i.e. it goes from zero to its stats).
          RHSDelta: *h.AbsPostSplitRight(),
        }

        deltaPostSplitLeft := h.DeltaPostSplitLeft()
        if !rec.ClusterSettings().Version.IsActive(ctx, clusterversion.VersionContainsEstimatesCounter) {
          deltaPostSplitLeft.ContainsEstimates = 0
        }

        if !rec.ClusterSettings().Version.IsActive(ctx, clusterversion.VersionContainsEstimatesCounter) {
                deltaPostSplitLeft.ContainsEstimates = 0
        }
+       if !rec.ClusterSettings().Version.IsActive(ctx, clusterversion.VersionAbortSpanBytes) {
+               deltaPostSplitLeft.AbortSpanBytes = 0
+               pd.Replicated.Split.RHSDelta.AbortSpanBytes = 0
+       }
        return deltaPostSplitLeft, pd, nil
 }

But yes, I'll have to repro instead of wildly speculating.

@irfansharif irfansharif force-pushed the 200812.deflake-version-upgrade branch 5 times, most recently from a6021c0 to 858a57b Compare August 13, 2020 21:52
Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tbg Added a (contrived) test that confirms the patch. Happy to not check-in the test into our repo, though I did try to generalize it a bit.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @madelineliao and @tbg)


pkg/kv/kvserver/batcheval/cmd_end_transaction.go, line 1039 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Are you sure? If my reading of the callers of this code is correct, deltaPostSplitLeft will essentially be ms, meaning that the existing zeroing out should override it just fine. Don't you want to massage the RHSDelta above? That would make more sense to me as it is used to seed the stats for the RHS, and will not go through the existing check.

Done.

@irfansharif irfansharif requested a review from tbg August 13, 2020 22:29
@irfansharif irfansharif force-pushed the 200812.deflake-version-upgrade branch from 858a57b to 95256b3 Compare August 14, 2020 01:12
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I think we need both.

Interesting, but why? deltaPostSplitLeft is newMS in this code here:

newMS, trigger, err := splitTrigger(
ctx, rec, batch, *ms, ct.SplitTrigger, txn.WriteTimestamp,
)
*ms = newMS

so we use it as the "main" stats (which makes sense), so its AbortSpanBytes should be zeroed out by the existing check before proposal. Hmm, your code only zeroes the RHS, so you've come around then?

For the test, I wonder if we can "amend" the existing main version-upgrade test instead of adding a new one? I think it's a good idea to have some more generic splitting going on in that one.

Thanks for the fix!

Reviewed 7 of 7 files at r2, 2 of 2 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @madelineliao)


pkg/kv/kvserver/batcheval/cmd_end_transaction.go, line 1034 at r3 (raw file):

	deltaPostSplitLeft := h.DeltaPostSplitLeft()
	if !rec.ClusterSettings().Version.IsActive(ctx, clusterversion.VersionContainsEstimatesCounter) {
		deltaPostSplitLeft.ContainsEstimates = 0

Heh, by my same reasoning, this is not necessary but it's necessary on the right. I don't think splits ever come out with ContainsEstimates set, on the RHS, though. Too late now anyway. :-)

Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, but why?

Yea I was mistaken, I walked through the codepaths after and realized what you were saying. I'll try seeing what the amended version of version-upgrade looks like and abandon this other test, but I might also defer to later given it's holding up the release. We could probably run workload kv against it in the same way I did here. TFTR!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @madelineliao)

Fixes cockroachdb#52627. Unskips it as well.

In cockroachdb#50938 we were careful to not transmit non-zero MVCCStats through the
replica proposal codepaths, but there seems to have been another
instance where we end up transmitting MVCCStats across replicas - during
splits. We similarly zero AbortSpanBytes out when VersionAbortSpanBytes
is inactive to avoid replica divergence.

We repro-ed the failure observed in cockroachdb#52627 in this
`splits/mixed-version` roachtest that we introduced in the PR, but are
choosing not to check in. We'll follow up in a future PR by improving
version-upgrade to add splits/workload running against it.

Release note: None
@irfansharif irfansharif force-pushed the 200812.deflake-version-upgrade branch from 95256b3 to 9522fdf Compare August 14, 2020 15:38
@irfansharif
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Aug 14, 2020

🕐 Waiting for PR status (Github check) to be set, probably by CI. Bors will automatically try to run when all required PR statuses are set.

@RaduBerinde RaduBerinde removed the request for review from madelineliao August 14, 2020 16:44
@irfansharif
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Aug 14, 2020

Build succeeded:

@craig craig bot merged commit 41b82b2 into cockroachdb:master Aug 14, 2020
@irfansharif irfansharif deleted the 200812.deflake-version-upgrade branch August 14, 2020 17:54
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Aug 19, 2020
Fixes cockroachdb#52907. Unskips it as well.

The patch in cockroachdb#52750 was not quite complete. We had zeroed out the
replicated RHS delta appropriately, but forgot to consider the RHS stats
added to the batch then later used to seed RHS state. I re-verified this
patch using the same `splits/mixed-version` test but running for more
iterations.

Release note: None
craig bot pushed a commit that referenced this pull request Aug 19, 2020
53011: batcheval: deflake version-upgrade (again) r=irfansharif a=irfansharif

Fixes #52907. Unskips it as well.

The patch in #52750 was not quite complete. We had zeroed out the
replicated RHS delta appropriately, but forgot to consider the RHS stats
added to the batch then later used to seed RHS state. I re-verified this
patch using the same `splits/mixed-version` test but running for more
iterations.

Release note: None

Co-authored-by: irfan sharif <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachtest: acceptance/version-upgrade failed consistency check
3 participants