storage: RelocateRange doesn't handle edge cases properly #29130

benesch · 2018-08-27T17:06:24Z

Lines 267 to 272 in 3c6c71d

    
           // TODO(benesch): RelocateRange needs to be made more robust. It cannot 
        
           // currently handle certain edge cases, like multiple stores on one node. It 
        
           // also adds all new replicas before removing any old replicas, rather than 
        
           // performing interleaved adds/removes, resulting in a moment where the 
        
           // number of replicas is potentially double the configured replication 
        
           // factor.

cockroach/pkg/storage/replica_command.go

Lines 956 to 957 in 2f2467c

    
           // This is best-effort; if replication queues are enabled and a change in 
        
           // membership happens at the same time, there will be errors.

a-robinson · 2018-09-05T19:45:48Z

I'm trying to come up with scenarios where adding all the replicas at once and doing all the removals afterwards causes problems in order to justify the extra computation cost of doing them one-by-one. Because we can do it one by one, but it'll involve a bunch more computation to pick the order.

It takes a bit of a stretch to come up with problem scenarios, though. I can't think of a particularly common configuration in which having potentially double the configured replication factor makes a range more susceptible to data unavailability or loss than either the start or end state of the RelocateRange command. In general, adding replicas only makes a range more susceptible to unavailability if they're poorly spread across localities such that the resulting range is overrepresented in a locality. However, for that to be possible, it's almost necessary for the range to be overrepresented in that locality either before or after the change. I believe a case like [2] is the main exception:

The cluster loses its ability to down-replicate after doing all its up-replicating. Perhaps a partition happens, or the meta ranges or rangelog table become unavailable. This could leave the cluster in an awkward, largely untested state with twice as many replicas as desired for a range for a long period of time. This doesn't directly break anything, but increases the chance of hitting an edge case.
Consider a range with 5 replicas: 2 in zone=a (nodes a1 and a2), 2 in zone=b (nodes b1 and b2), 1 in zone=c (node c). RelocateRange decides to relocate just the replicas in zone a, just that after the add state we have nodes (a1, a2, a3, a4, b1, b2, c1). In this scenario, an outage in zone a will take the cluster down when neither the before or after state is susceptible to such an outage. I don't believe there's an equivalent 3 node scenario -- you could argue that a rebalance from (a1,b1,c1) to (a2,b1,c1) is susceptible to losing datacenter a1, but rebalancing one replica at a time doesn't avoid that (stability: Rebalances must be atomic #12768).

Given that scenario 2 is effectively equivalent to #12768 but for NumReplicas=5 instead of NumReplicas=3, it's not the worst risk in the world, but our vulnerability there is bad enough that I don't think we should want to open it up even more. Have any thoughts on this, @benesch?

The basic implementation approach here will be to add one new replica, then use allocator.RemoveTarget with a restricted list of removal candidates to pick an appropriate replica to remove, then add the next, and so on. RemoveTarget will take things like diversity into account such that we do the removals in a reasonable order. We can optionally also run the adds through AllocateTarget in case adding them in a certain order is better than another for diversity reasons. It could also keep RelocateRange from making changes that violate zone constraints, which could be seen as both a good and a bad thing.

I care about this more than before now due to #28852. I could do the one-by-one work there, but RelocateRange is quite a natural API for what it wants to do, so it makes a good opportunity to clean this issue up along the way. I don't plan on guaranteeing proper handling of multiple stores per node, but I don't think it'd be too hard to add later.

benesch · 2018-09-06T05:53:26Z

Wait, are you offering to take this off my plate as part of #28852? 😍

Given that scenario 2 is effectively equivalent to #12768 but for NumReplicas=5 instead of NumReplicas=3, it's not the worst risk in the world, but our vulnerability there is bad enough that I don't think we should want to open it up even more. Have any thoughts on this, @benesch?

I do think it's slightly less bad than #12768—presumably the chance of a permanent data center failure is way lower than the chance of a permanent node failure. But definitely still unfortunate and worth protecting against.

It could also keep RelocateRange from making changes that violate zone constraints, which could be seen as both a good and a bad thing.

If you do this—and it should work just fine for the merge queue, because mergeable ranges must be in the same zone—I think you'll need an escape hatch to continue supporting EXPERIMENTAL_RELOCATE.

Everything else you mentioned sounds spot-on to my, admittedly less-expert, ears.

a-robinson · 2018-09-06T18:45:31Z

If you do this—and it should work just fine for the merge queue, because mergeable ranges must be in the same zone—I think you'll need an escape hatch to continue supporting EXPERIMENTAL_RELOCATE.

So you think it's worth allowing manual placement to violate constraints? There may be the occasional situation where someone wants that, but most of the time I'd think that requiring constraints to be obeyed would be desirable for preventing mistakes.

benesch · 2018-09-06T19:01:00Z

Hmm, maybe you're right. I was thinking tests might need total control over replica placement... but I can't actually think of a reason that any of those tests would need to violate constraints.

Does a more gradual process of adding one replica then removing one until all the desired changes have been made, using the existing allocator code to do so in a reasonable order. Fixes cockroachdb#29130 Release note: None

And convert it to an Admin command, AdminRelocateRange, as was required to both let RelocateRange access the local store's resources while still making it callable from the sql package. It now does a more gradual process of adding one replica then removing one until all the desired changes have been made, using the existing allocator code to do so in a reasonable order. Fixes cockroachdb#29130 Release note (sql change): The EXPERIMENTAL_RELOCATE statement no longer temporarily increases the number of replicas in a range more than one above the range's replication factor, preventing rare edge cases of unavailability.

29684: storage: Avoid adding all replicas at once in RelocateRange r=a-robinson a=a-robinson Does a more gradual process of adding one replica then removing one until all the desired changes have been made, using the existing allocator code to do so in a reasonable order. Fixes #29130 Release note: None ------------------- This is still a work-in-progress in the sense that it isn't really usable from the SQL code yet, since I just implemented it as a method on `Store`. If the approach looks reasonable, though, it can be converted it into an Admin API method. Co-authored-by: Alex Robinson <[email protected]>

And convert it to an Admin command, AdminRelocateRange, as was required to both let RelocateRange access the local store's resources while still making it callable from the sql package. It now does a more gradual process of adding one replica then removing one until all the desired changes have been made, using the existing allocator code to do so in a reasonable order. Fixes cockroachdb#29130 Release note (sql change): The EXPERIMENTAL_RELOCATE statement no longer temporarily increases the number of replicas in a range more than one above the range's replication factor, preventing rare edge cases of unavailability. # # Commit message recommendations: # # --- # <pkg>: <short description> # # <long description> # # Release note (category): <release note description> # --- # # Wrap long lines! 72 columns is best. # # The release note must be present if your commit has # user-facing changes. Leave the default above if not. # # Categories for release notes: # - cli change # - sql change # - admin ui change # - general change (e.g., change of required Go version) # - build change (e.g., compatibility with older CPUs) # - enterprise change (e.g., change to backup/restore) # - backwards-incompatible change # - performance improvement # - bug fix

benesch self-assigned this Aug 27, 2018

benesch added the A-kv-distribution Relating to rebalancing and leasing. label Aug 27, 2018

a-robinson assigned a-robinson and unassigned benesch Sep 5, 2018

a-robinson mentioned this issue Sep 5, 2018

storage: make load-based replica rebalancing decisions at the store level #28852

Merged

This was referenced Sep 6, 2018

storage: Avoid adding all replicas at once in RelocateRange #29684

Merged

storage: re-enable stats-based rebalancing #17979

Closed

tbg added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Sep 20, 2018

tbg added this to the 2.1 milestone Sep 20, 2018

craig bot closed this as completed in #29684 Sep 20, 2018

a-robinson mentioned this issue Sep 20, 2018

release-2.1: storage: Avoid adding all replicas at once in RelocateRange #30461

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: RelocateRange doesn't handle edge cases properly #29130

storage: RelocateRange doesn't handle edge cases properly #29130

benesch commented Aug 27, 2018

a-robinson commented Sep 5, 2018

benesch commented Sep 6, 2018

a-robinson commented Sep 6, 2018

benesch commented Sep 6, 2018

storage: RelocateRange doesn't handle edge cases properly #29130

storage: RelocateRange doesn't handle edge cases properly #29130

Comments

benesch commented Aug 27, 2018

a-robinson commented Sep 5, 2018

benesch commented Sep 6, 2018

a-robinson commented Sep 6, 2018

benesch commented Sep 6, 2018