Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: replicate/wide failed #54444

Closed
cockroach-teamcity opened this issue Sep 16, 2020 · 30 comments
Closed

roachtest: replicate/wide failed #54444

cockroach-teamcity opened this issue Sep 16, 2020 · 30 comments
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

(roachtest).replicate/wide failed on release-20.2@009e4b919f04f6e324731f87b4cb6a11f414553d:

The test failed on branch=release-20.2, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/replicate/wide/run_1
	test_runner.go:801: test timed out (10m0s)

More

Artifacts: /replicate/wide
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity cockroach-teamcity added branch-release-20.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Sep 16, 2020
@cockroach-teamcity cockroach-teamcity added this to the 20.2 milestone Sep 16, 2020
@aayushshah15
Copy link
Contributor

aayushshah15 commented Sep 17, 2020

The test seems to have timed out because at least one of the ranges seem to have lost quorum.

85532 kv/kvserver/queue.go:1202 ⋮ [n5,replicate] 1 replicas failing with ‹"range requires a replication change, but live replicas [(n5,s5):4 (n2,s2):11 (n1,s1):13] don't constitute a quorum for [(n4,s4):10VOTER_DEMOTING (n8,s8):14 (n7,s7):7 (n5,s5):4 (n9,s9):9 (n2,s2):11 (n1,s1):13]:"›

Seems "real", @tbg I'm assigning you for now since it seems like something's up with replica removals.

I also see ~10 error messages of the following:

E200916 11:27:48.836884 1256786 kv/kvserver/queue.go:1087 ⋮ [n4,replicaGC,s4,r6/10:‹/Table/{SystemCon…-11}›] ‹operation "replicaGC queue process replica 6" timed out after 1m0s›

EDIT: also adding myself for now since I remembered you're OOO for a bit

@tbg
Copy link
Member

tbg commented Sep 23, 2020

I'm running this test locally now and while I haven't seen that error (haven't tried very hard) I am noticing that this test replicates much slower than I would expect. For example, after decommissioning a node, it takes well over a minute to move 35 ranges, and most of that time is spent not replicating anything. I think I will prioritize making this test snazzy as this will likely highlight a few places in which we're needlessly wasting time. Once the test is fast - we can see if it's still flaky.

@tbg
Copy link
Member

tbg commented Sep 23, 2020

I may have spoken to soon. The test upreplicates from one to nine nodes. Each upreplication has two steps (add a learner, promote to voter). So we're sending, in this test, roughly 35*8 snapshots and are performing twice as many replication change transactions. So when the replica count is moving, it seems to do so reasonably fast. And the reason it is sometimes stalling is because there's a waiting period for a time-until-store-dead in there, too.

@tbg
Copy link
Member

tbg commented Sep 24, 2020

I'm looking at the logs for one of the failures above (or from the same failure on the other branch, not sure which logs they are) but I think I see the problem. In that log, the loss of quorum is like this:

1 3VOTER_DEMOTING 5 6 down[7 8 9]

We are in the config (1 5 6 7 8 9) && (1 5 6 7 8 9), i.e. to make progress both groups need to reach quorum. But the first group (1 5 6 7 8 9) is unavailable since 7,8,9 are down. We see the same in Aayush's comment above.

Here's what I think happened. The test first makes sure the range is 9-replicated:

(1 2 3 4 5 6 7 8 9)

Then it takes down nodes 7-9 and decommissions n7. It waits for the replication factor to drop to 7 (avoiding the even 8).

Now you would expect to see n7 removed, and ideally another dead replica. Instead, here I think we saw something like a removal of two live replicas, none of which were 7:

(1 _ 3 _ 5 6 7 8 9)

At this point, the range is still healthy, it has four out of 7 replicas live. Also, the test is oblivious to this weird state - it sees seven replicas and is happy. The reason to explain this behavior is I think that the downreplication did not happen as a result of the decommissioning, but because three replicas were non-live, and so the "adaptive replication factor" kicks in and (I think) moves the replication target factor to 5 (have to check, but I think that's how it works). The test now also explicitly sets a replication factor of five (which, I think, doesn't matter since that's already what the allocator applies) and now we go for the bang: the allocator decides to remove n6. Bad idea, as that puts the new config at (1 _ 3 _ 5 _ 7 8 9) which has three of six replicas down and is unable to make progress. Now why does it remove n6? I can only assume that due to the full-cluster restart this test performs, it was a little confused about who's live and who is not. The test has a comment that gives another clue, which is that the allocator can't remove a node for which it has no gossiped store descriptor until that node is considered "dead" (i.e. down for a long time), and n7-n9 have no gossiped store descriptor due to the crash.

This is all rather unfortunate. We already dislike adaptive replication factors (#52528) and this is another instance of them being undesired.

However, without adaptive replication factors, I think the same thing could have happened if the operator changed the replication factor (from 9 to 5) at the right time, like this:

  • replicate to 9 nodes
  • restart cluster, bring only nodes 1-6 back (as the test does)
  • set replication factor to 5

The allocator would consider n7-n9 down but could not remove them until they are marked as dead (i.e. in ~minutes only). In fact, it wouldn't necessarily even consider them for removal. First, since there are no "dead" or decommissioning replicas, it will return this action

if have > need {
// Range is over-replicated, and should remove a replica.
// Ranges with an even number of replicas get extra priority because
// they have a more fragile quorum.
priority := removeExtraReplicaPriority - float64(have%2)
action := AllocatorRemove
log.VEventf(ctx, 3, "%s - need=%d, have=%d, priority=%.2f", action, need, have, priority)
return action, priority
}

and so the replicate queue will go here

case AllocatorRemove:
return rq.remove(ctx, repl, voterReplicas, dryRun)

(note that voterReplicas here is all voters, not just live or dead ones - it's all nodes in our example)

We will then hit this path

if len(upToDateReplicas) > newQuorum {
// The number of up-to-date replicas is larger than the new quorum. Any
// replica can be removed, though we want to filter out brandNewReplicaID.
if brandNewReplicaID != 0 {
candidates := make([]roachpb.ReplicaDescriptor, 0, len(replicas)-len(upToDateReplicas))
for _, r := range replicas {
if r.ReplicaID != brandNewReplicaID {
candidates = append(candidates, r)
}
}
return candidates
}
return replicas
}

and essentially remove any replica - but remember that n7-n9 are unremovable, because they won't be returned in the store list here:

sl, _, _ := a.storePool.getStoreListFromIDs(existingStoreIDs, storeFilterNone)

I'm not 100% sure I'm getting everything right here, but there are two main take-aways here:

  1. the inability to downreplicate off "ungossiped" stores makes it more likely that the allocator shoots itself in the foot
  2. even if we could downreplicate off ungossiped stores, the allocator might still shoot itself in the foot

It doesn't seem like there is an "obvious" fix for the issue (though extra waiting should "fix" the test). Since the allocator always operates on potentially stale information, all we can do is make it less likely that this problem occurs. I would argue that the moment the operator sets a replication factor of 5, they are accepting that having more than two nodes down is unacceptable. There is a slight footgun here - if we're setting a lower replication factor in response to an outage (whyever we would do that - does not seem reasonable) the allocator may be unhelpfully causing an actual unavailability. I think we have some "quorum checks" in place somewhere, but they are likely bypassed in this special case where the whole cluster just got restarted (so a lot of ephemeral state may not have repercolated).

@tbg
Copy link
Member

tbg commented Sep 24, 2020

Looking back at the history of this test, it seems that this failure mode is exactly what we were trying to prevent: #34122 (comment)
I'll have to dig into these changes to figure out if we're just hitting a rarer version of that failure mode now or whether we've regressed. Given that we see this on 20.2 as well, I would guess we have regressed. Fun.

@nvanbenschoten
Copy link
Member

if we're setting a lower replication factor in response to an outage (whyever we would do that - does not seem reasonable)

For what it's worth, @andy-kimball has discussed doing exactly this kind of "downshifting" to avoid sustained fragility under region failure. For instance, imagine a 5-way replication factor, with 2 replicas in us-west, 2 replicas in us-central, and 1 replica in us-east. If us-west goes down, we'd like to avoid a fragile state for the entire duration of the outage. The idea is that we could drop the replication factor down to 3 so that we're able to tolerate a single replica failure during this period. Of course, we won't be able to tolerate the simultaneous failure of a region and one other replica, but that's not the goal.

All that goes to say, it doesn't seem particularly unreasonable to lower the replication factor in response to an outage and expect the dead replicas to be removed, though it may be impossible to guarantee such behavior in practice, as you've described.

@tbg
Copy link
Member

tbg commented Sep 29, 2020

Looking more into root cause. We saw this first on June 27 here. I looked at the closed issues and basically this never flaked after April 2019, when we fixed a flake. So it must be a regression. I'll try a repro at the oldest SHA, 8f768ad.

tbg added a commit to tbg/cockroach that referenced this issue Sep 30, 2020
Saw this while investigating cockroachdb#54444.

Release note: None
craig bot pushed a commit that referenced this issue Sep 30, 2020
54687: roachtest: log teardown to different file r=andreimatei a=tbg

We perform various operations on teardown which typically produce a few
pages of log output (dead nodes, consistency, getting logs, ...).
These tend to make the real test failure harder to find, so this
commit switches out the loggers before we perform teardown; teardown
will log to `teardown.log` instead. The main log (`test.log`) now
ends as follows:

```
[...]
09:41:06 allocator.go:315: 0 mis-replicated ranges
09:41:07 test.go:337: test failure: 	allocator.go:321,test_runner.go:755: boom!
09:41:07 test_runner.go:769: tearing down after failure; see teardown.log
```

Release note: None


54742: kvserver: properly redact the unavailable range message r=knz a=tbg

Saw this while investigating #54444.

Release note: None

55034: cli/flags: remove `doctor` from the list of timeout-supporting commands r=tbg a=knz

Fixes #54931.

This was unintentionally added - doctor is not meant to support
configurable timeouts (just yet).

Release note: None

Co-authored-by: Tobias Grieger <[email protected]>
Co-authored-by: Raphael 'kena' Poss <[email protected]>
@tbg
Copy link
Member

tbg commented Sep 30, 2020

I ran this on 8f768d ~60 times today, and it never failed.

@tbg
Copy link
Member

tbg commented Sep 30, 2020

All 100 runs passed on 485c196 (~master).

@tbg
Copy link
Member

tbg commented Oct 1, 2020

Another 100 passed. Uhm?!

@tbg
Copy link
Member

tbg commented Oct 1, 2020

And another 100. Feels like I'm messing something up here.

@tbg
Copy link
Member

tbg commented Oct 6, 2020

Starting another 100 from a clean slate on 009e4b9, a SHA that we have definitely seen the problem on. If that fails to reproduce, I will try it with the specific version of roachtest/workload/etc from that night as well.

@tbg
Copy link
Member

tbg commented Oct 6, 2020

^-- they all passed. Now running another 100 with roachtest from same sha.

@tbg
Copy link
Member

tbg commented Oct 6, 2020

I realized that in a "cosmetic" refactor of the roachtest, I had changed where we set the replication factor, which explains why I wasn't getting a repro. I did get one with a roachtest SHA preceding that change.
I also looked at the code - we didn't change anything here. I will instrument this tomorrow to learn where things go sideways and will post a summary.

@tbg
Copy link
Member

tbg commented Oct 9, 2020

Hmm. Everything looks sound. Going to start another repro cycle with vmodule=1 for allocator and replicate_queue.

@nvanbenschoten
Copy link
Member

Which line are we logging "could not select an appropriate replica to be removed" from? In the replicateQueue? I don't see how we could log that error and then continue with a removal either. Are you sure we're performing the removal in the same pass through the allocator?

@tbg
Copy link
Member

tbg commented Oct 13, 2020

Weird. Ok, I think I'm chasing down the wrong alley here. I was assuming the replicate queue overreplication path is doing this demotion that wrecks the group. But it's not! I have verbose logging on and am not seeing this message

log.VEventf(ctx, 1, "removing replica %+v due to over-replication: %s",

for that demotion (but am seeing them for all the earlier expected ones).

Someone else is deciding that a replica removal needs to happen.

@tbg
Copy link
Member

tbg commented Oct 13, 2020

I started logging a stack trace in execChangeReplicasCmd and am seeing that it's coming in via an AdminChangeReplicas command.

I also took a double take and am seeing the store rebalancer do something:

tobias-1602581821-05-n9cpu1-0003> I201013 09:48:29.541848 107 kv/kvserver/store_rebalancer.go:284 ⋮ [n3,s3,store-rebalancer] ran out of leases worth transferring and qps (174.30) is still above desired threshold (149.70); considering load-based replica rebalances

That basically must be it. Ok. I have no idea why I had such a hard time reproducing this, I don't think we tampered at all with the store balancer knobs, and I don't think we changed anything about how we measure load. Hmm.

@tbg
Copy link
Member

tbg commented Oct 13, 2020

Yep, it's the store rebalancer. It's trying to rebalance, but since the range is over-replicated, it ends up removing a random live replica.

kv/kvserver/store_rebalancer.go:309 ⋮ [n5,s5,store-rebalancer] rebalancing r6 (81.65 qps) from ‹(n4,s4):4,(n5,s5):13,(n3,s3):3,(n7,s7):8,(n8,s8):9,(n9,s9):10,(n6,s6):11› to ‹[n4,s4 n3,s3 n7,s7 n8,s8 n9,s9 n6,s6]› to better balance load

@tbg
Copy link
Member

tbg commented Oct 13, 2020

The "culprit" is that in chooseReplicaToRebalance, we don't "just" swap out a replica for another (though that would brick the cluster just the same), but really do an ad-hoc replicate queue like run that takes into account the desired target replication factor:

desiredReplicas := GetNeededReplicas(*zone.NumReplicas, clusterNodes)

I don't know what an easy fix here is. If I add code that says "if you'd change the replication factor, just leave it alone" I think we will still see failures: so far it's been removing a node that's problematic, but a true swap onto a dead node will brick the group just the same. Really we want all of the safety hatches to be applied to any attempts to make replication changes, but since the store rebalancer is essentially "its own thing" this is not easy.

tbg added a commit to tbg/cockroach that referenced this issue Nov 25, 2020
This used to live in the replicate queue, but there are other
entry points to replication changes, notably the store rebalancer
which caused cockroachdb#54444.

Move the check in the guts of replication changes where it is
guaranteed to be invoked.

Fixes cockroachdb#50729
Touches cockroachdb#54444 (release-20.2)

Release note (bug fix): in rare situations, an automated replication
change could result in a loss of quorum. This would require down nodes
and a simultaneous change in the replication factor. Note that a change
in the replication factor can occur automatically if the cluster is
comprised of less than five available nodes. Experimentally the likeli-
hood of encountering this issue, even under contrived conditions, was
small.
tbg added a commit to tbg/cockroach that referenced this issue Dec 3, 2020
This used to live in the replicate queue, but there are other
entry points to replication changes, notably the store rebalancer
which caused cockroachdb#54444.

Move the check in the guts of replication changes where it is
guaranteed to be invoked.

Fixes cockroachdb#50729
Touches cockroachdb#54444 (release-20.2)

Release note (bug fix): in rare situations, an automated replication
change could result in a loss of quorum. This would require down nodes
and a simultaneous change in the replication factor. Note that a change
in the replication factor can occur automatically if the cluster is
comprised of less than five available nodes. Experimentally the likeli-
hood of encountering this issue, even under contrived conditions, was
small.
craig bot pushed a commit that referenced this issue Dec 3, 2020
56735: kvserverpb: move quorum safeguard into execChangeReplicasTxn r=aayushshah15 a=tbg

This used to live in the replicate queue, but there are other
entry points to replication changes, notably the store rebalancer
which caused #54444.

Move the check in the guts of replication changes where it is
guaranteed to be invoked.

Fixes #50729
Touches #54444 (release-20.2)

@aayushshah15 only requesting your review since you're in the area.
Feel free to opt out.

Release note (bug fix): in rare situations, an automated replication
change could result in a loss of quorum. This would require down nodes
and a simultaneous change in the replication factor. Note that a change
in the replication factor can occur automatically if the cluster is
comprised of less than five available nodes. Experimentally the likeli-
hood of encountering this issue, even under contrived conditions, was
small.


Co-authored-by: Tobias Grieger <[email protected]>
@irfansharif
Copy link
Contributor

@tbg: Did we mean to back port #56735 to 20.2?

@tbg
Copy link
Member

tbg commented Feb 3, 2021

I was entertaining it at the time, but now that that PR is already paged out, I think the risk-reward ratio isn't where it needs to be (though the problem is pretty bad if you do hit it... but it seems that people don't do, in practice). I tried the backport and it looks like there are a number of conflicts that would need manual resolution. Possibly that they're mostly in the tests, but still, I'm apprehensive.

@irfansharif
Copy link
Contributor

I'll just close this issue then (was just grooming the backlog as triage oncall).

tbg added a commit to tbg/cockroach that referenced this issue May 4, 2021
When the replication factor is lowered and the StoreRebalancer
attempts a rebalance, it will accidentally perform a downreplication.
Since it wasn't ever supposed to do that, the downreplication is
pretty haphazard and doesn't safeguard quorum in the same way
that a "proper" downreplication likely would. Prevent if from
changing the number of voters and non-voters to avoid this issue.

Annoyingly, I [knew] about this problem, but instead of fixing it at the
source - as this commit does - I added a lower- level check that could
then not be backported to release-20.2, where we are now seeing this
problem.

[knew]: cockroachdb#54444 (comment)

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue May 4, 2021
When the replication factor is lowered and the StoreRebalancer
attempts a rebalance, it will accidentally perform a downreplication.
Since it wasn't ever supposed to do that, the downreplication is
pretty haphazard and doesn't safeguard quorum in the same way
that a "proper" downreplication likely would. Prevent if from
changing the number of voters and non-voters to avoid this issue.

Annoyingly, I [knew] about this problem, but instead of fixing it at the
source - as this commit does - I added a lower- level check that could
then not be backported to release-20.2, where we are now seeing this
problem.

[knew]: cockroachdb#54444 (comment)

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue May 20, 2021
When the replication factor is lowered and the StoreRebalancer
attempts a rebalance, it will accidentally perform a downreplication.
Since it wasn't ever supposed to do that, the downreplication is
pretty haphazard and doesn't safeguard quorum in the same way
that a "proper" downreplication likely would. Prevent if from
changing the number of voters and non-voters to avoid this issue.

Annoyingly, I [knew] about this problem, but instead of fixing it at the
source - as this commit does - I added a lower- level check that could
then not be backported to release-20.2, where we are now seeing this
problem.

[knew]: cockroachdb#54444 (comment)

Release note: None
tbg added a commit that referenced this issue May 20, 2021
When the replication factor is lowered and the StoreRebalancer
attempts a rebalance, it will accidentally perform a downreplication.
Since it wasn't ever supposed to do that, the downreplication is
pretty haphazard and doesn't safeguard quorum in the same way
that a "proper" downreplication likely would. Prevent if from
changing the number of voters and non-voters to avoid this issue.

Annoyingly, I [knew] about this problem, but instead of fixing it at the
source - as this commit does - I added a lower- level check that could
then not be backported to release-20.2, where we are now seeing this
problem.

[knew]: #54444 (comment)

Release note: None
craig bot pushed a commit that referenced this issue May 20, 2021
64650: kvserver: prevent StoreRebalancer from downreplicating r=erikgrinaker,nvanbenschoten a=tbg

When the replication factor is lowered and the StoreRebalancer
attempts a rebalance, it will accidentally perform a downreplication.
Since it wasn't ever supposed to do that, the downreplication is
pretty haphazard and doesn't safeguard quorum in the same way
that a "proper" downreplication likely would. Prevent if from
changing the number of voters and non-voters to avoid this issue.

Annoyingly, I [knew] about this problem, but instead of fixing it at the
source - as this commit does - I added a lower- level check that could
then not be backported to release-20.2, where we are now seeing this
problem.

[knew]: #54444 (comment)

#64649

Release note: None


Co-authored-by: Tobias Grieger <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

5 participants