roachtest: replicate/wide failed #54444

cockroach-teamcity · 2020-09-16T11:29:25Z

(roachtest).replicate/wide failed on release-20.2@009e4b919f04f6e324731f87b4cb6a11f414553d:

The test failed on branch=release-20.2, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/replicate/wide/run_1
	test_runner.go:801: test timed out (10m0s)

More

Artifacts: /replicate/wide
Related:

roachtest: replicate/wide failed #50729 roachtest: replicate/wide failed C-test-failure O-roachtest O-robot branch-master

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

The text was updated successfully, but these errors were encountered:

aayushshah15 · 2020-09-17T19:40:18Z

The test seems to have timed out because at least one of the ranges seem to have lost quorum.

85532 kv/kvserver/queue.go:1202 ⋮ [n5,replicate] 1 replicas failing with ‹"range requires a replication change, but live replicas [(n5,s5):4 (n2,s2):11 (n1,s1):13] don't constitute a quorum for [(n4,s4):10VOTER_DEMOTING (n8,s8):14 (n7,s7):7 (n5,s5):4 (n9,s9):9 (n2,s2):11 (n1,s1):13]:"›

Seems "real", @tbg I'm assigning you for now since it seems like something's up with replica removals.

I also see ~10 error messages of the following:

E200916 11:27:48.836884 1256786 kv/kvserver/queue.go:1087 ⋮ [n4,replicaGC,s4,r6/10:‹/Table/{SystemCon…-11}›] ‹operation "replicaGC queue process replica 6" timed out after 1m0s›

EDIT: also adding myself for now since I remembered you're OOO for a bit

tbg · 2020-09-23T09:19:32Z

I'm running this test locally now and while I haven't seen that error (haven't tried very hard) I am noticing that this test replicates much slower than I would expect. For example, after decommissioning a node, it takes well over a minute to move 35 ranges, and most of that time is spent not replicating anything. I think I will prioritize making this test snazzy as this will likely highlight a few places in which we're needlessly wasting time. Once the test is fast - we can see if it's still flaky.

tbg · 2020-09-23T11:56:00Z

I may have spoken to soon. The test upreplicates from one to nine nodes. Each upreplication has two steps (add a learner, promote to voter). So we're sending, in this test, roughly 35*8 snapshots and are performing twice as many replication change transactions. So when the replica count is moving, it seems to do so reasonably fast. And the reason it is sometimes stalling is because there's a waiting period for a time-until-store-dead in there, too.

tbg · 2020-09-24T10:30:27Z

I'm looking at the logs for one of the failures above (or from the same failure on the other branch, not sure which logs they are) but I think I see the problem. In that log, the loss of quorum is like this:

1 3VOTER_DEMOTING 5 6 down[7 8 9]

We are in the config (1 5 6 7 8 9) && (1 5 6 7 8 9), i.e. to make progress both groups need to reach quorum. But the first group (1 5 6 7 8 9) is unavailable since 7,8,9 are down. We see the same in Aayush's comment above.

Here's what I think happened. The test first makes sure the range is 9-replicated:

(1 2 3 4 5 6 7 8 9)

Then it takes down nodes 7-9 and decommissions n7. It waits for the replication factor to drop to 7 (avoiding the even 8).

Now you would expect to see n7 removed, and ideally another dead replica. Instead, here I think we saw something like a removal of two live replicas, none of which were 7:

(1 _ 3 _ 5 6 7 8 9)

At this point, the range is still healthy, it has four out of 7 replicas live. Also, the test is oblivious to this weird state - it sees seven replicas and is happy. The reason to explain this behavior is I think that the downreplication did not happen as a result of the decommissioning, but because three replicas were non-live, and so the "adaptive replication factor" kicks in and (I think) moves the replication target factor to 5 (have to check, but I think that's how it works). The test now also explicitly sets a replication factor of five (which, I think, doesn't matter since that's already what the allocator applies) and now we go for the bang: the allocator decides to remove n6. Bad idea, as that puts the new config at (1 _ 3 _ 5 _ 7 8 9) which has three of six replicas down and is unable to make progress. Now why does it remove n6? I can only assume that due to the full-cluster restart this test performs, it was a little confused about who's live and who is not. The test has a comment that gives another clue, which is that the allocator can't remove a node for which it has no gossiped store descriptor until that node is considered "dead" (i.e. down for a long time), and n7-n9 have no gossiped store descriptor due to the crash.

This is all rather unfortunate. We already dislike adaptive replication factors (#52528) and this is another instance of them being undesired.

However, without adaptive replication factors, I think the same thing could have happened if the operator changed the replication factor (from 9 to 5) at the right time, like this:

replicate to 9 nodes
restart cluster, bring only nodes 1-6 back (as the test does)
set replication factor to 5

The allocator would consider n7-n9 down but could not remove them until they are marked as dead (i.e. in ~minutes only). In fact, it wouldn't necessarily even consider them for removal. First, since there are no "dead" or decommissioning replicas, it will return this action

cockroach/pkg/kv/kvserver/allocator.go

Lines 442 to 450 in ed34965

    
           if have > need { 
        
           	// Range is over-replicated, and should remove a replica. 
        
           	// Ranges with an even number of replicas get extra priority because 
        
           	// they have a more fragile quorum. 
        
           	priority := removeExtraReplicaPriority - float64(have%2) 
        
           	action := AllocatorRemove 
        
           	log.VEventf(ctx, 3, "%s - need=%d, have=%d, priority=%.2f", action, need, have, priority) 
        
           	return action, priority 
        
           }

and so the replicate queue will go here

cockroach/pkg/kv/kvserver/replicate_queue.go

Lines 360 to 361 in c719214

    
           case AllocatorRemove: 
        
           	return rq.remove(ctx, repl, voterReplicas, dryRun)

(note that voterReplicas here is all voters, not just live or dead ones - it's all nodes in our example)

We will then hit this path

cockroach/pkg/kv/kvserver/allocator.go

Lines 1333 to 1346 in ed34965

    
           if len(upToDateReplicas) > newQuorum { 
        
           	// The number of up-to-date replicas is larger than the new quorum. Any 
        
           	// replica can be removed, though we want to filter out brandNewReplicaID. 
        
           	if brandNewReplicaID != 0 { 
        
           		candidates := make([]roachpb.ReplicaDescriptor, 0, len(replicas)-len(upToDateReplicas)) 
        
           		for _, r := range replicas { 
        
           			if r.ReplicaID != brandNewReplicaID { 
        
           				candidates = append(candidates, r) 
        
           			} 
        
           		} 
        
           		return candidates 
        
           	} 
        
           	return replicas 
        
           }

and essentially remove any replica - but remember that n7-n9 are unremovable, because they won't be returned in the store list here:

cockroach/pkg/kv/kvserver/allocator.go

Line 563 in ed34965

sl, _, _ := a.storePool.getStoreListFromIDs(existingStoreIDs, storeFilterNone)

I'm not 100% sure I'm getting everything right here, but there are two main take-aways here:

the inability to downreplicate off "ungossiped" stores makes it more likely that the allocator shoots itself in the foot
even if we could downreplicate off ungossiped stores, the allocator might still shoot itself in the foot

It doesn't seem like there is an "obvious" fix for the issue (though extra waiting should "fix" the test). Since the allocator always operates on potentially stale information, all we can do is make it less likely that this problem occurs. I would argue that the moment the operator sets a replication factor of 5, they are accepting that having more than two nodes down is unacceptable. There is a slight footgun here - if we're setting a lower replication factor in response to an outage (whyever we would do that - does not seem reasonable) the allocator may be unhelpfully causing an actual unavailability. I think we have some "quorum checks" in place somewhere, but they are likely bypassed in this special case where the whole cluster just got restarted (so a lot of ephemeral state may not have repercolated).

tbg · 2020-09-24T13:28:29Z

Looking back at the history of this test, it seems that this failure mode is exactly what we were trying to prevent: #34122 (comment)
I'll have to dig into these changes to figure out if we're just hitting a rarer version of that failure mode now or whether we've regressed. Given that we see this on 20.2 as well, I would guess we have regressed. Fun.

nvanbenschoten · 2020-09-28T14:52:21Z

if we're setting a lower replication factor in response to an outage (whyever we would do that - does not seem reasonable)

For what it's worth, @andy-kimball has discussed doing exactly this kind of "downshifting" to avoid sustained fragility under region failure. For instance, imagine a 5-way replication factor, with 2 replicas in us-west, 2 replicas in us-central, and 1 replica in us-east. If us-west goes down, we'd like to avoid a fragile state for the entire duration of the outage. The idea is that we could drop the replication factor down to 3 so that we're able to tolerate a single replica failure during this period. Of course, we won't be able to tolerate the simultaneous failure of a region and one other replica, but that's not the goal.

All that goes to say, it doesn't seem particularly unreasonable to lower the replication factor in response to an outage and expect the dead replicas to be removed, though it may be impossible to guarantee such behavior in practice, as you've described.

tbg · 2020-09-29T14:19:13Z

Looking more into root cause. We saw this first on June 27 here. I looked at the closed issues and basically this never flaked after April 2019, when we fixed a flake. So it must be a regression. I'll try a repro at the oldest SHA, 8f768ad.

Saw this while investigating cockroachdb#54444. Release note: None

54687: roachtest: log teardown to different file r=andreimatei a=tbg We perform various operations on teardown which typically produce a few pages of log output (dead nodes, consistency, getting logs, ...). These tend to make the real test failure harder to find, so this commit switches out the loggers before we perform teardown; teardown will log to `teardown.log` instead. The main log (`test.log`) now ends as follows: ``` [...] 09:41:06 allocator.go:315: 0 mis-replicated ranges 09:41:07 test.go:337: test failure: allocator.go:321,test_runner.go:755: boom! 09:41:07 test_runner.go:769: tearing down after failure; see teardown.log ``` Release note: None 54742: kvserver: properly redact the unavailable range message r=knz a=tbg Saw this while investigating #54444. Release note: None 55034: cli/flags: remove `doctor` from the list of timeout-supporting commands r=tbg a=knz Fixes #54931. This was unintentionally added - doctor is not meant to support configurable timeouts (just yet). Release note: None Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]>

tbg · 2020-09-30T15:33:44Z

I ran this on 8f768d ~60 times today, and it never failed.

tbg · 2020-09-30T20:41:17Z

All 100 runs passed on 485c196 (~master).

tbg · 2020-10-01T07:36:02Z

Another 100 passed. Uhm?!

tbg · 2020-10-01T11:53:56Z

And another 100. Feels like I'm messing something up here.

tbg · 2020-10-06T10:24:57Z

Starting another 100 from a clean slate on 009e4b9, a SHA that we have definitely seen the problem on. If that fails to reproduce, I will try it with the specific version of roachtest/workload/etc from that night as well.

tbg · 2020-10-06T11:36:11Z

^-- they all passed. Now running another 100 with roachtest from same sha.

tbg · 2020-10-06T16:09:05Z

I realized that in a "cosmetic" refactor of the roachtest, I had changed where we set the replication factor, which explains why I wasn't getting a repro. I did get one with a roachtest SHA preceding that change.
I also looked at the code - we didn't change anything here. I will instrument this tomorrow to learn where things go sideways and will post a summary.

tbg · 2020-10-09T09:24:10Z

Hmm. Everything looks sound. Going to start another repro cycle with vmodule=1 for allocator and replicate_queue.

nvanbenschoten · 2020-10-09T17:51:57Z

Which line are we logging "could not select an appropriate replica to be removed" from? In the replicateQueue? I don't see how we could log that error and then continue with a removal either. Are you sure we're performing the removal in the same pass through the allocator?

tbg · 2020-10-13T09:22:00Z

Weird. Ok, I think I'm chasing down the wrong alley here. I was assuming the replicate queue overreplication path is doing this demotion that wrecks the group. But it's not! I have verbose logging on and am not seeing this message

cockroach/pkg/kv/kvserver/replicate_queue.go

Line 702 in 5ca5b10

log.VEventf(ctx, 1, "removing replica %+v due to over-replication: %s",

for that demotion (but am seeing them for all the earlier expected ones).

Someone else is deciding that a replica removal needs to happen.

tbg · 2020-10-13T10:09:10Z

I started logging a stack trace in execChangeReplicasCmd and am seeing that it's coming in via an AdminChangeReplicas command.

I also took a double take and am seeing the store rebalancer do something:

tobias-1602581821-05-n9cpu1-0003> I201013 09:48:29.541848 107 kv/kvserver/store_rebalancer.go:284 ⋮ [n3,s3,store-rebalancer] ran out of leases worth transferring and qps (174.30) is still above desired threshold (149.70); considering load-based replica rebalances

That basically must be it. Ok. I have no idea why I had such a hard time reproducing this, I don't think we tampered at all with the store balancer knobs, and I don't think we changed anything about how we measure load. Hmm.

tbg · 2020-10-13T12:31:23Z

Yep, it's the store rebalancer. It's trying to rebalance, but since the range is over-replicated, it ends up removing a random live replica.

kv/kvserver/store_rebalancer.go:309 ⋮ [n5,s5,store-rebalancer] rebalancing r6 (81.65 qps) from ‹(n4,s4):4,(n5,s5):13,(n3,s3):3,(n7,s7):8,(n8,s8):9,(n9,s9):10,(n6,s6):11› to ‹[n4,s4 n3,s3 n7,s7 n8,s8 n9,s9 n6,s6]› to better balance load

tbg · 2020-10-13T12:44:48Z

The "culprit" is that in chooseReplicaToRebalance, we don't "just" swap out a replica for another (though that would brick the cluster just the same), but really do an ad-hoc replicate queue like run that takes into account the desired target replication factor:

cockroach/pkg/kv/kvserver/store_rebalancer.go

Line 504 in ed34965

desiredReplicas := GetNeededReplicas(*zone.NumReplicas, clusterNodes)

I don't know what an easy fix here is. If I add code that says "if you'd change the replication factor, just leave it alone" I think we will still see failures: so far it's been removing a node that's problematic, but a true swap onto a dead node will brick the group just the same. Really we want all of the safety hatches to be applied to any attempts to make replication changes, but since the store rebalancer is essentially "its own thing" this is not easy.

This used to live in the replicate queue, but there are other entry points to replication changes, notably the store rebalancer which caused cockroachdb#54444. Move the check in the guts of replication changes where it is guaranteed to be invoked. Fixes cockroachdb#50729 Touches cockroachdb#54444 (release-20.2) Release note (bug fix): in rare situations, an automated replication change could result in a loss of quorum. This would require down nodes and a simultaneous change in the replication factor. Note that a change in the replication factor can occur automatically if the cluster is comprised of less than five available nodes. Experimentally the likeli- hood of encountering this issue, even under contrived conditions, was small.

@aayushshah15

56735: kvserverpb: move quorum safeguard into execChangeReplicasTxn r=aayushshah15 a=tbg This used to live in the replicate queue, but there are other entry points to replication changes, notably the store rebalancer which caused #54444. Move the check in the guts of replication changes where it is guaranteed to be invoked. Fixes #50729 Touches #54444 (release-20.2) @aayushshah15 only requesting your review since you're in the area. Feel free to opt out. Release note (bug fix): in rare situations, an automated replication change could result in a loss of quorum. This would require down nodes and a simultaneous change in the replication factor. Note that a change in the replication factor can occur automatically if the cluster is comprised of less than five available nodes. Experimentally the likeli- hood of encountering this issue, even under contrived conditions, was small. Co-authored-by: Tobias Grieger <[email protected]>

irfansharif · 2021-02-03T16:32:54Z

@tbg: Did we mean to back port #56735 to 20.2?

tbg · 2021-02-03T20:04:58Z

I was entertaining it at the time, but now that that PR is already paged out, I think the risk-reward ratio isn't where it needs to be (though the problem is pretty bad if you do hit it... but it seems that people don't do, in practice). I tried the backport and it looks like there are a number of conflicts that would need manual resolution. Possibly that they're mostly in the tests, but still, I'm apprehensive.

irfansharif · 2021-02-03T20:18:14Z

I'll just close this issue then (was just grooming the backlog as triage oncall).

When the replication factor is lowered and the StoreRebalancer attempts a rebalance, it will accidentally perform a downreplication. Since it wasn't ever supposed to do that, the downreplication is pretty haphazard and doesn't safeguard quorum in the same way that a "proper" downreplication likely would. Prevent if from changing the number of voters and non-voters to avoid this issue. Annoyingly, I [knew] about this problem, but instead of fixing it at the source - as this commit does - I added a lower- level check that could then not be backported to release-20.2, where we are now seeing this problem. [knew]: cockroachdb#54444 (comment) Release note: None

When the replication factor is lowered and the StoreRebalancer attempts a rebalance, it will accidentally perform a downreplication. Since it wasn't ever supposed to do that, the downreplication is pretty haphazard and doesn't safeguard quorum in the same way that a "proper" downreplication likely would. Prevent if from changing the number of voters and non-voters to avoid this issue. Annoyingly, I [knew] about this problem, but instead of fixing it at the source - as this commit does - I added a lower- level check that could then not be backported to release-20.2, where we are now seeing this problem. [knew]: #54444 (comment) Release note: None

64650: kvserver: prevent StoreRebalancer from downreplicating r=erikgrinaker,nvanbenschoten a=tbg When the replication factor is lowered and the StoreRebalancer attempts a rebalance, it will accidentally perform a downreplication. Since it wasn't ever supposed to do that, the downreplication is pretty haphazard and doesn't safeguard quorum in the same way that a "proper" downreplication likely would. Prevent if from changing the number of voters and non-voters to avoid this issue. Annoyingly, I [knew] about this problem, but instead of fixing it at the source - as this commit does - I added a lower- level check that could then not be backported to release-20.2, where we are now seeing this problem. [knew]: #54444 (comment) #64649 Release note: None Co-authored-by: Tobias Grieger <[email protected]>

cockroach-teamcity added branch-release-20.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Sep 16, 2020

cockroach-teamcity added this to the 20.2 milestone Sep 16, 2020

aayushshah15 assigned tbg and aayushshah15 Sep 17, 2020

tbg unassigned aayushshah15 Sep 22, 2020

tbg mentioned this issue Sep 24, 2020

Ranges configured with num_replicas=5 reported as underreplicated in a cluster with 4 live nodes but not in a cluster with 3 live nodes #52528

Closed

tbg mentioned this issue Sep 24, 2020

kvserver: properly redact the unavailable range message #54742

Merged

tbg added a commit to tbg/cockroach that referenced this issue Sep 30, 2020

kvserver: properly redact the unavailable range message

0c717f9

Saw this while investigating cockroachdb#54444. Release note: None

tbg mentioned this issue Nov 16, 2020

kvserverpb: move quorum safeguard into execChangeReplicasTxn #56735

Merged

irfansharif closed this as completed Feb 3, 2021

This was referenced May 4, 2021

kvserver: replicated into unavailability #64649

Closed

kvserver: prevent StoreRebalancer from downreplicating #64650

Merged

tbg mentioned this issue May 4, 2021

release-20.2: kvserver: prevent StoreRebalancer from downreplicating #64652

Merged

tbg mentioned this issue May 20, 2021

release-21.1: kvserver: prevent StoreRebalancer from downreplicating #65501

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: replicate/wide failed #54444

roachtest: replicate/wide failed #54444

cockroach-teamcity commented Sep 16, 2020

aayushshah15 commented Sep 17, 2020 •

edited

Loading

tbg commented Sep 23, 2020

tbg commented Sep 23, 2020

tbg commented Sep 24, 2020

tbg commented Sep 24, 2020

nvanbenschoten commented Sep 28, 2020

tbg commented Sep 29, 2020

tbg commented Sep 30, 2020

tbg commented Sep 30, 2020

tbg commented Oct 1, 2020

tbg commented Oct 1, 2020

tbg commented Oct 6, 2020

tbg commented Oct 6, 2020

tbg commented Oct 6, 2020

tbg commented Oct 9, 2020

nvanbenschoten commented Oct 9, 2020

tbg commented Oct 13, 2020

tbg commented Oct 13, 2020

tbg commented Oct 13, 2020

tbg commented Oct 13, 2020

irfansharif commented Feb 3, 2021

tbg commented Feb 3, 2021

irfansharif commented Feb 3, 2021

roachtest: replicate/wide failed #54444

roachtest: replicate/wide failed #54444

Comments

cockroach-teamcity commented Sep 16, 2020

aayushshah15 commented Sep 17, 2020 • edited Loading

tbg commented Sep 23, 2020

tbg commented Sep 23, 2020

tbg commented Sep 24, 2020

tbg commented Sep 24, 2020

nvanbenschoten commented Sep 28, 2020

tbg commented Sep 29, 2020

tbg commented Sep 30, 2020

tbg commented Sep 30, 2020

tbg commented Oct 1, 2020

tbg commented Oct 1, 2020

tbg commented Oct 6, 2020

tbg commented Oct 6, 2020

tbg commented Oct 6, 2020

tbg commented Oct 9, 2020

nvanbenschoten commented Oct 9, 2020

tbg commented Oct 13, 2020

tbg commented Oct 13, 2020

tbg commented Oct 13, 2020

tbg commented Oct 13, 2020

irfansharif commented Feb 3, 2021

tbg commented Feb 3, 2021

irfansharif commented Feb 3, 2021

aayushshah15 commented Sep 17, 2020 •

edited

Loading