-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: replicate/wide failed #54444
Comments
The test seems to have timed out because at least one of the ranges seem to have lost quorum.
Seems "real", @tbg I'm assigning you for now since it seems like something's up with replica removals. I also see ~10 error messages of the following:
EDIT: also adding myself for now since I remembered you're OOO for a bit |
I'm running this test locally now and while I haven't seen that error (haven't tried very hard) I am noticing that this test replicates much slower than I would expect. For example, after decommissioning a node, it takes well over a minute to move 35 ranges, and most of that time is spent not replicating anything. I think I will prioritize making this test snazzy as this will likely highlight a few places in which we're needlessly wasting time. Once the test is fast - we can see if it's still flaky. |
I may have spoken to soon. The test upreplicates from one to nine nodes. Each upreplication has two steps (add a learner, promote to voter). So we're sending, in this test, roughly 35*8 snapshots and are performing twice as many replication change transactions. So when the replica count is moving, it seems to do so reasonably fast. And the reason it is sometimes stalling is because there's a waiting period for a time-until-store-dead in there, too. |
I'm looking at the logs for one of the failures above (or from the same failure on the other branch, not sure which logs they are) but I think I see the problem. In that log, the loss of quorum is like this:
We are in the config Here's what I think happened. The test first makes sure the range is 9-replicated:
Then it takes down nodes 7-9 and decommissions n7. It waits for the replication factor to drop to 7 (avoiding the even 8). Now you would expect to see n7 removed, and ideally another dead replica. Instead, here I think we saw something like a removal of two live replicas, none of which were 7:
At this point, the range is still healthy, it has four out of 7 replicas live. Also, the test is oblivious to this weird state - it sees seven replicas and is happy. The reason to explain this behavior is I think that the downreplication did not happen as a result of the decommissioning, but because three replicas were non-live, and so the "adaptive replication factor" kicks in and (I think) moves the replication target factor to 5 (have to check, but I think that's how it works). The test now also explicitly sets a replication factor of five (which, I think, doesn't matter since that's already what the allocator applies) and now we go for the bang: the allocator decides to remove n6. Bad idea, as that puts the new config at This is all rather unfortunate. We already dislike adaptive replication factors (#52528) and this is another instance of them being undesired. However, without adaptive replication factors, I think the same thing could have happened if the operator changed the replication factor (from 9 to 5) at the right time, like this:
The allocator would consider n7-n9 down but could not remove them until they are marked as dead (i.e. in ~minutes only). In fact, it wouldn't necessarily even consider them for removal. First, since there are no "dead" or decommissioning replicas, it will return this action cockroach/pkg/kv/kvserver/allocator.go Lines 442 to 450 in ed34965
and so the replicate queue will go here cockroach/pkg/kv/kvserver/replicate_queue.go Lines 360 to 361 in c719214
(note that We will then hit this path cockroach/pkg/kv/kvserver/allocator.go Lines 1333 to 1346 in ed34965
and essentially remove any replica - but remember that n7-n9 are unremovable, because they won't be returned in the store list here: cockroach/pkg/kv/kvserver/allocator.go Line 563 in ed34965
I'm not 100% sure I'm getting everything right here, but there are two main take-aways here:
It doesn't seem like there is an "obvious" fix for the issue (though extra waiting should "fix" the test). Since the allocator always operates on potentially stale information, all we can do is make it less likely that this problem occurs. I would argue that the moment the operator sets a replication factor of 5, they are accepting that having more than two nodes down is unacceptable. There is a slight footgun here - if we're setting a lower replication factor in response to an outage (whyever we would do that - does not seem reasonable) the allocator may be unhelpfully causing an actual unavailability. I think we have some "quorum checks" in place somewhere, but they are likely bypassed in this special case where the whole cluster just got restarted (so a lot of ephemeral state may not have repercolated). |
Looking back at the history of this test, it seems that this failure mode is exactly what we were trying to prevent: #34122 (comment) |
For what it's worth, @andy-kimball has discussed doing exactly this kind of "downshifting" to avoid sustained fragility under region failure. For instance, imagine a 5-way replication factor, with 2 replicas in us-west, 2 replicas in us-central, and 1 replica in us-east. If us-west goes down, we'd like to avoid a fragile state for the entire duration of the outage. The idea is that we could drop the replication factor down to 3 so that we're able to tolerate a single replica failure during this period. Of course, we won't be able to tolerate the simultaneous failure of a region and one other replica, but that's not the goal. All that goes to say, it doesn't seem particularly unreasonable to lower the replication factor in response to an outage and expect the dead replicas to be removed, though it may be impossible to guarantee such behavior in practice, as you've described. |
Saw this while investigating cockroachdb#54444. Release note: None
54687: roachtest: log teardown to different file r=andreimatei a=tbg We perform various operations on teardown which typically produce a few pages of log output (dead nodes, consistency, getting logs, ...). These tend to make the real test failure harder to find, so this commit switches out the loggers before we perform teardown; teardown will log to `teardown.log` instead. The main log (`test.log`) now ends as follows: ``` [...] 09:41:06 allocator.go:315: 0 mis-replicated ranges 09:41:07 test.go:337: test failure: allocator.go:321,test_runner.go:755: boom! 09:41:07 test_runner.go:769: tearing down after failure; see teardown.log ``` Release note: None 54742: kvserver: properly redact the unavailable range message r=knz a=tbg Saw this while investigating #54444. Release note: None 55034: cli/flags: remove `doctor` from the list of timeout-supporting commands r=tbg a=knz Fixes #54931. This was unintentionally added - doctor is not meant to support configurable timeouts (just yet). Release note: None Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]>
I ran this on 8f768d ~60 times today, and it never failed. |
All 100 runs passed on 485c196 (~master). |
Another 100 passed. Uhm?! |
And another 100. Feels like I'm messing something up here. |
Starting another 100 from a clean slate on 009e4b9, a SHA that we have definitely seen the problem on. If that fails to reproduce, I will try it with the specific version of roachtest/workload/etc from that night as well. |
^-- they all passed. Now running another 100 with roachtest from same sha. |
I realized that in a "cosmetic" refactor of the roachtest, I had changed where we set the replication factor, which explains why I wasn't getting a repro. I did get one with a roachtest SHA preceding that change. |
Hmm. Everything looks sound. Going to start another repro cycle with vmodule=1 for allocator and replicate_queue. |
Which line are we logging "could not select an appropriate replica to be removed" from? In the |
Weird. Ok, I think I'm chasing down the wrong alley here. I was assuming the replicate queue overreplication path is doing this demotion that wrecks the group. But it's not! I have verbose logging on and am not seeing this message cockroach/pkg/kv/kvserver/replicate_queue.go Line 702 in 5ca5b10
for that demotion (but am seeing them for all the earlier expected ones). Someone else is deciding that a replica removal needs to happen. |
I started logging a stack trace in I also took a double take and am seeing the store rebalancer do something:
That basically must be it. Ok. I have no idea why I had such a hard time reproducing this, I don't think we tampered at all with the store balancer knobs, and I don't think we changed anything about how we measure load. Hmm. |
Yep, it's the store rebalancer. It's trying to rebalance, but since the range is over-replicated, it ends up removing a random live replica.
|
The "culprit" is that in cockroach/pkg/kv/kvserver/store_rebalancer.go Line 504 in ed34965
I don't know what an easy fix here is. If I add code that says "if you'd change the replication factor, just leave it alone" I think we will still see failures: so far it's been removing a node that's problematic, but a true swap onto a dead node will brick the group just the same. Really we want all of the safety hatches to be applied to any attempts to make replication changes, but since the store rebalancer is essentially "its own thing" this is not easy. |
This used to live in the replicate queue, but there are other entry points to replication changes, notably the store rebalancer which caused cockroachdb#54444. Move the check in the guts of replication changes where it is guaranteed to be invoked. Fixes cockroachdb#50729 Touches cockroachdb#54444 (release-20.2) Release note (bug fix): in rare situations, an automated replication change could result in a loss of quorum. This would require down nodes and a simultaneous change in the replication factor. Note that a change in the replication factor can occur automatically if the cluster is comprised of less than five available nodes. Experimentally the likeli- hood of encountering this issue, even under contrived conditions, was small.
This used to live in the replicate queue, but there are other entry points to replication changes, notably the store rebalancer which caused cockroachdb#54444. Move the check in the guts of replication changes where it is guaranteed to be invoked. Fixes cockroachdb#50729 Touches cockroachdb#54444 (release-20.2) Release note (bug fix): in rare situations, an automated replication change could result in a loss of quorum. This would require down nodes and a simultaneous change in the replication factor. Note that a change in the replication factor can occur automatically if the cluster is comprised of less than five available nodes. Experimentally the likeli- hood of encountering this issue, even under contrived conditions, was small.
56735: kvserverpb: move quorum safeguard into execChangeReplicasTxn r=aayushshah15 a=tbg This used to live in the replicate queue, but there are other entry points to replication changes, notably the store rebalancer which caused #54444. Move the check in the guts of replication changes where it is guaranteed to be invoked. Fixes #50729 Touches #54444 (release-20.2) @aayushshah15 only requesting your review since you're in the area. Feel free to opt out. Release note (bug fix): in rare situations, an automated replication change could result in a loss of quorum. This would require down nodes and a simultaneous change in the replication factor. Note that a change in the replication factor can occur automatically if the cluster is comprised of less than five available nodes. Experimentally the likeli- hood of encountering this issue, even under contrived conditions, was small. Co-authored-by: Tobias Grieger <[email protected]>
I was entertaining it at the time, but now that that PR is already paged out, I think the risk-reward ratio isn't where it needs to be (though the problem is pretty bad if you do hit it... but it seems that people don't do, in practice). I tried the backport and it looks like there are a number of conflicts that would need manual resolution. Possibly that they're mostly in the tests, but still, I'm apprehensive. |
I'll just close this issue then (was just grooming the backlog as triage oncall). |
When the replication factor is lowered and the StoreRebalancer attempts a rebalance, it will accidentally perform a downreplication. Since it wasn't ever supposed to do that, the downreplication is pretty haphazard and doesn't safeguard quorum in the same way that a "proper" downreplication likely would. Prevent if from changing the number of voters and non-voters to avoid this issue. Annoyingly, I [knew] about this problem, but instead of fixing it at the source - as this commit does - I added a lower- level check that could then not be backported to release-20.2, where we are now seeing this problem. [knew]: cockroachdb#54444 (comment) Release note: None
When the replication factor is lowered and the StoreRebalancer attempts a rebalance, it will accidentally perform a downreplication. Since it wasn't ever supposed to do that, the downreplication is pretty haphazard and doesn't safeguard quorum in the same way that a "proper" downreplication likely would. Prevent if from changing the number of voters and non-voters to avoid this issue. Annoyingly, I [knew] about this problem, but instead of fixing it at the source - as this commit does - I added a lower- level check that could then not be backported to release-20.2, where we are now seeing this problem. [knew]: cockroachdb#54444 (comment) Release note: None
When the replication factor is lowered and the StoreRebalancer attempts a rebalance, it will accidentally perform a downreplication. Since it wasn't ever supposed to do that, the downreplication is pretty haphazard and doesn't safeguard quorum in the same way that a "proper" downreplication likely would. Prevent if from changing the number of voters and non-voters to avoid this issue. Annoyingly, I [knew] about this problem, but instead of fixing it at the source - as this commit does - I added a lower- level check that could then not be backported to release-20.2, where we are now seeing this problem. [knew]: cockroachdb#54444 (comment) Release note: None
When the replication factor is lowered and the StoreRebalancer attempts a rebalance, it will accidentally perform a downreplication. Since it wasn't ever supposed to do that, the downreplication is pretty haphazard and doesn't safeguard quorum in the same way that a "proper" downreplication likely would. Prevent if from changing the number of voters and non-voters to avoid this issue. Annoyingly, I [knew] about this problem, but instead of fixing it at the source - as this commit does - I added a lower- level check that could then not be backported to release-20.2, where we are now seeing this problem. [knew]: #54444 (comment) Release note: None
64650: kvserver: prevent StoreRebalancer from downreplicating r=erikgrinaker,nvanbenschoten a=tbg When the replication factor is lowered and the StoreRebalancer attempts a rebalance, it will accidentally perform a downreplication. Since it wasn't ever supposed to do that, the downreplication is pretty haphazard and doesn't safeguard quorum in the same way that a "proper" downreplication likely would. Prevent if from changing the number of voters and non-voters to avoid this issue. Annoyingly, I [knew] about this problem, but instead of fixing it at the source - as this commit does - I added a lower- level check that could then not be backported to release-20.2, where we are now seeing this problem. [knew]: #54444 (comment) #64649 Release note: None Co-authored-by: Tobias Grieger <[email protected]>
(roachtest).replicate/wide failed on release-20.2@009e4b919f04f6e324731f87b4cb6a11f414553d:
More
Artifacts: /replicate/wide
Related:
See this test on roachdash
powered by pkg/cmd/internal/issues
The text was updated successfully, but these errors were encountered: