Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: RelocateRange doesn't handle edge cases properly #29130

Closed
benesch opened this issue Aug 27, 2018 · 4 comments
Closed

storage: RelocateRange doesn't handle edge cases properly #29130

benesch opened this issue Aug 27, 2018 · 4 comments
Assignees
Labels
A-kv-distribution Relating to rebalancing and leasing. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Milestone

Comments

@benesch
Copy link
Contributor

benesch commented Aug 27, 2018

// TODO(benesch): RelocateRange needs to be made more robust. It cannot
// currently handle certain edge cases, like multiple stores on one node. It
// also adds all new replicas before removing any old replicas, rather than
// performing interleaved adds/removes, resulting in a moment where the
// number of replicas is potentially double the configured replication
// factor.

// This is best-effort; if replication queues are enabled and a change in
// membership happens at the same time, there will be errors.

@benesch benesch self-assigned this Aug 27, 2018
@benesch benesch added the A-kv-distribution Relating to rebalancing and leasing. label Aug 27, 2018
@a-robinson
Copy link
Contributor

I'm trying to come up with scenarios where adding all the replicas at once and doing all the removals afterwards causes problems in order to justify the extra computation cost of doing them one-by-one. Because we can do it one by one, but it'll involve a bunch more computation to pick the order.

It takes a bit of a stretch to come up with problem scenarios, though. I can't think of a particularly common configuration in which having potentially double the configured replication factor makes a range more susceptible to data unavailability or loss than either the start or end state of the RelocateRange command. In general, adding replicas only makes a range more susceptible to unavailability if they're poorly spread across localities such that the resulting range is overrepresented in a locality. However, for that to be possible, it's almost necessary for the range to be overrepresented in that locality either before or after the change. I believe a case like [2] is the main exception:

  1. The cluster loses its ability to down-replicate after doing all its up-replicating. Perhaps a partition happens, or the meta ranges or rangelog table become unavailable. This could leave the cluster in an awkward, largely untested state with twice as many replicas as desired for a range for a long period of time. This doesn't directly break anything, but increases the chance of hitting an edge case.
  2. Consider a range with 5 replicas: 2 in zone=a (nodes a1 and a2), 2 in zone=b (nodes b1 and b2), 1 in zone=c (node c). RelocateRange decides to relocate just the replicas in zone a, just that after the add state we have nodes (a1, a2, a3, a4, b1, b2, c1). In this scenario, an outage in zone a will take the cluster down when neither the before or after state is susceptible to such an outage. I don't believe there's an equivalent 3 node scenario -- you could argue that a rebalance from (a1,b1,c1) to (a2,b1,c1) is susceptible to losing datacenter a1, but rebalancing one replica at a time doesn't avoid that (stability: Rebalances must be atomic #12768).

Given that scenario 2 is effectively equivalent to #12768 but for NumReplicas=5 instead of NumReplicas=3, it's not the worst risk in the world, but our vulnerability there is bad enough that I don't think we should want to open it up even more. Have any thoughts on this, @benesch?

The basic implementation approach here will be to add one new replica, then use allocator.RemoveTarget with a restricted list of removal candidates to pick an appropriate replica to remove, then add the next, and so on. RemoveTarget will take things like diversity into account such that we do the removals in a reasonable order. We can optionally also run the adds through AllocateTarget in case adding them in a certain order is better than another for diversity reasons. It could also keep RelocateRange from making changes that violate zone constraints, which could be seen as both a good and a bad thing.


I care about this more than before now due to #28852. I could do the one-by-one work there, but RelocateRange is quite a natural API for what it wants to do, so it makes a good opportunity to clean this issue up along the way. I don't plan on guaranteeing proper handling of multiple stores per node, but I don't think it'd be too hard to add later.

@benesch
Copy link
Contributor Author

benesch commented Sep 6, 2018

Wait, are you offering to take this off my plate as part of #28852? 😍

Given that scenario 2 is effectively equivalent to #12768 but for NumReplicas=5 instead of NumReplicas=3, it's not the worst risk in the world, but our vulnerability there is bad enough that I don't think we should want to open it up even more. Have any thoughts on this, @benesch?

I do think it's slightly less bad than #12768—presumably the chance of a permanent data center failure is way lower than the chance of a permanent node failure. But definitely still unfortunate and worth protecting against.

It could also keep RelocateRange from making changes that violate zone constraints, which could be seen as both a good and a bad thing.

If you do this—and it should work just fine for the merge queue, because mergeable ranges must be in the same zone—I think you'll need an escape hatch to continue supporting EXPERIMENTAL_RELOCATE.

Everything else you mentioned sounds spot-on to my, admittedly less-expert, ears.

@a-robinson
Copy link
Contributor

If you do this—and it should work just fine for the merge queue, because mergeable ranges must be in the same zone—I think you'll need an escape hatch to continue supporting EXPERIMENTAL_RELOCATE.

So you think it's worth allowing manual placement to violate constraints? There may be the occasional situation where someone wants that, but most of the time I'd think that requiring constraints to be obeyed would be desirable for preventing mistakes.

@benesch
Copy link
Contributor Author

benesch commented Sep 6, 2018

Hmm, maybe you're right. I was thinking tests might need total control over replica placement... but I can't actually think of a reason that any of those tests would need to violate constraints.

a-robinson added a commit to a-robinson/cockroach that referenced this issue Sep 6, 2018
Does a more gradual process of adding one replica then removing one
until all the desired changes have been made, using the existing
allocator code to do so in a reasonable order.

Fixes cockroachdb#29130

Release note: None
a-robinson added a commit to a-robinson/cockroach that referenced this issue Sep 14, 2018
And convert it to an Admin command, AdminRelocateRange, as was required
to both let RelocateRange access the local store's resources while still
making it callable from the sql package.

It now does a more gradual process of adding one replica then removing
one until all the desired changes have been made, using the existing
allocator code to do so in a reasonable order.

Fixes cockroachdb#29130

Release note (sql change): The EXPERIMENTAL_RELOCATE statement no longer
temporarily increases the number of replicas in a range more than one
above the range's replication factor, preventing rare edge cases of
unavailability.
a-robinson added a commit to a-robinson/cockroach that referenced this issue Sep 14, 2018
And convert it to an Admin command, AdminRelocateRange, as was required
to both let RelocateRange access the local store's resources while still
making it callable from the sql package.

It now does a more gradual process of adding one replica then removing
one until all the desired changes have been made, using the existing
allocator code to do so in a reasonable order.

Fixes cockroachdb#29130

Release note (sql change): The EXPERIMENTAL_RELOCATE statement no longer
temporarily increases the number of replicas in a range more than one
above the range's replication factor, preventing rare edge cases of
unavailability.
a-robinson added a commit to a-robinson/cockroach that referenced this issue Sep 19, 2018
And convert it to an Admin command, AdminRelocateRange, as was required
to both let RelocateRange access the local store's resources while still
making it callable from the sql package.

It now does a more gradual process of adding one replica then removing
one until all the desired changes have been made, using the existing
allocator code to do so in a reasonable order.

Fixes cockroachdb#29130

Release note (sql change): The EXPERIMENTAL_RELOCATE statement no longer
temporarily increases the number of replicas in a range more than one
above the range's replication factor, preventing rare edge cases of
unavailability.
@tbg tbg added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Sep 20, 2018
@tbg tbg added this to the 2.1 milestone Sep 20, 2018
craig bot pushed a commit that referenced this issue Sep 20, 2018
29684: storage: Avoid adding all replicas at once in RelocateRange r=a-robinson a=a-robinson

Does a more gradual process of adding one replica then removing one
until all the desired changes have been made, using the existing
allocator code to do so in a reasonable order.

Fixes #29130

Release note: None

-------------------

This is still a work-in-progress in the sense that it isn't really usable from the SQL code yet, since I just implemented it as a method on `Store`. If the approach looks reasonable, though, it can be converted it into an Admin API method.

Co-authored-by: Alex Robinson <[email protected]>
@craig craig bot closed this as completed in #29684 Sep 20, 2018
a-robinson added a commit to a-robinson/cockroach that referenced this issue Sep 20, 2018
And convert it to an Admin command, AdminRelocateRange, as was required
to both let RelocateRange access the local store's resources while still
making it callable from the sql package.

It now does a more gradual process of adding one replica then removing
one until all the desired changes have been made, using the existing
allocator code to do so in a reasonable order.

Fixes cockroachdb#29130

Release note (sql change): The EXPERIMENTAL_RELOCATE statement no longer
temporarily increases the number of replicas in a range more than one
above the range's replication factor, preventing rare edge cases of
unavailability.

#
# Commit message recommendations:
#
#     ---
#     <pkg>: <short description>
#
#     <long description>
#
#     Release note (category): <release note description>
#     ---
#
# Wrap long lines! 72 columns is best.
#
# The release note must be present if your commit has
# user-facing changes. Leave the default above if not.
#
# Categories for release notes:
# - cli change
# - sql change
# - admin ui change
# - general change (e.g., change of required Go version)
# - build change (e.g., compatibility with older CPUs)
# - enterprise change (e.g., change to backup/restore)
# - backwards-incompatible change
# - performance improvement
# - bug fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

No branches or pull requests

3 participants