Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: rule solver post merge issues #10275

Closed
15 of 19 tasks
BramGruneir opened this issue Oct 27, 2016 · 34 comments
Closed
15 of 19 tasks

storage: rule solver post merge issues #10275

BramGruneir opened this issue Oct 27, 2016 · 34 comments
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior.
Milestone

Comments

@BramGruneir
Copy link
Member

BramGruneir commented Oct 27, 2016

During the re-introduction of the rule solver, there were a number of issues that popped up. This issue will tack those.

See #10252, #10507

@BramGruneir BramGruneir self-assigned this Oct 27, 2016
BramGruneir added a commit to BramGruneir/cockroach that referenced this issue Oct 27, 2016
This adds back in 3 commits that were removed to facilitate the merge of develop
back to master. One other commit, is no longer required.

Follow up fixes are tracked in cockroachdb#10275.

Closes cockroachdb#9336

1) 4446345
storage: add constraint rule solver for allocation

Rules are represented as a single function that returns the candidacy of the
store as well as a float value representing the score. These scores are then
aggregated from all rules and returns the stores sorted by them.

Current rules:
- ruleReplicasUniqueNodes ensures that no two replicas are put on the same node.
- ruleConstraints enforces that required and prohibited constraints are
  followed, and that stores with more positive constraints are ranked higher.
- ruleDiversity ensures that nodes that have the fewest locality tiers in common
  are given higher priority.
- ruleCapacity prioritizes placing data on empty nodes when the choice is
  available and prevents data from going onto mostly full nodes.

2) dd3229a
storage: implemented RuleSolver into allocator

3) 27353a8
storage: removed unused rangeCountBalancer

There was a 4th commit that is no longer required. The simulation was already
converging since adding a rebalance threshold.
4e29a36
storage/simulation: only rebalance 50% of ranges on each iteration so it will
converge
@petermattis
Copy link
Collaborator

Add in the ability for a replica to realize that it is on an incorrect replica.

What does it mean for a replica to be on an "incorrect replica"?

@petermattis
Copy link
Collaborator

Having a configurable list of rules is overkill.

I'd add that the rule structure can be inflexible. For example, deciding to expand the list of candidates to include stores which don't match a positive constraint is awkward to fit into such an approach.

@d4l3k
Copy link
Contributor

d4l3k commented Oct 27, 2016

Doesn't the rule solver already take into account stores that don't match positive constraints? Stores with more positive constraints will have a higher score than stores with fewer matching constraints.

@petermattis
Copy link
Collaborator

That's a good point. I guess we'd need to understand how hard we try to
match the positive constraints. Randomness can throw a wrench in there. For
example, if you have a small subset of nodes which match the positive
constraint, randomness at the candidate level would make selection of those
nodes unlikely.

On Thursday, October 27, 2016, Tristan Rice [email protected]
wrote:

Doesn't the existing system already take into account stores that don't
match positive constraints? Stores with more positive constraints will have
a higher score than stores with fewer matching constraints.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#10275 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AF6f90klN8SgGFYUy5MbJqvtOrNRIidMks5q4SGDgaJpZM4Kiyrl
.

@BramGruneir
Copy link
Member Author

I think this can be solved easily using two scores instead of one.
The first is the constraint score, the second is heuristic load balancing score.
Find all stores that have the highest constraint score and randomly choose two from this pool and pick the one with the highest balancing score.

This would be a lot cleaner than the current solution which uses weights of 0.01 for rebalancing heuristic while 1.00 for constraints. And of course, any rule can invalidate store entirely.

    {
        weight: 1.0,
        run:    ruleReplicasUniqueNodes,
    },
    {
        weight: 1.0,
        run:    ruleConstraints,
    },
    {
        weight: 0.01,
        run:    ruleCapacity,
    },
    {
        weight: 0.1,
        run:    ruleDiversity,
    },

@petermattis
Copy link
Collaborator

Find all stores that have the highest constraint score and randomly choose two from this pool and pick the one with the highest balancing score.

What does it mean to "find all stores that have the highest constraint score"? I would have imagined the constraint rule to filter the candidates, not apply a score. If a store does not match a required or positive constraint, we definitely want to exclude it from the list of candidate, right?

@BramGruneir
Copy link
Member Author

We have different type of constraints.
Positive, Required and Negative. Required and negative constraints rule out stores, while positive constraints aim for diversity.

see: https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/expressive_zone_config.md

When a required or negative constraint is violated, the rule returns false and the store is ruled out. For positive constraints the store returns a true, regardless if it matches or not, and instead increases the score if it does match. So if we split our rules into heuristics and constraint rules, we can quickly find all the stores that have the highest constraint store and then randomly choose from those based on their heuristic scores. (n.b. that heuristic constraints can still rule out of a store if it is overfilled.)

@petermattis
Copy link
Collaborator

So if we split our rules into heuristics and constraint rules, we can quickly find all the stores that have the highest constraint store and then randomly choose from those based on their heuristic scores.

"highest constraint score"?

The part that isn't clear to me is where you set the threshold for what is a "highest constraint score" and what is not. Put another way, I'm not clear on what this would look like in code or pseudo-code. Can you elaborate?

BramGruneir added a commit to BramGruneir/cockroach that referenced this issue Nov 7, 2016
Instead of applying
cockroachdb@1ef40f3
or cockroachdb#10252, this finishes the reapplication of the rule solver. However, this
also puts the rule solver under the environment flag
COCKROACH_ENABLE_RULE_SOLVER for ease of testing and defaults to not enabled.

The follow up to this commit is cockroachdb#10275 and a lot of testing to ensure that the
rule solver does indeed perform as expected.

Closes cockroachdb#9336
@jseldess
Copy link
Contributor

jseldess commented Nov 7, 2016

What happens if you define locality tiers inconsistently? The RFC says:

If the order of tags on one node doesn't match those on others, we would warn the user. In such a case, we will use the order that the majority of nodes have.

Is that true?

Also, what do we do to inform the user that a constraint does not match any existing locality/attribute? Is it possible to validate constraints when a zone config is set? Users could add constraints and then update localities/attributes, so I imagine this would just be a warning, but in cases where there's a real mistake, this could really help.

@d4l3k
Copy link
Contributor

d4l3k commented Nov 7, 2016

That statement is true in the current code.

The original idea was to have it log to console as well as display errors as a banner in the UI. There's a PR sitting around somewhere to do that.
As for matching good constraints, I wrote a tool as part of this for checking the effect of the constraints.

#8887

BramGruneir added a commit to BramGruneir/cockroach that referenced this issue Nov 9, 2016
Instead of applying 1ef40f3
or cockroachdb#10252, this finishes the reapplication of the rule solver. However, this
also puts the rule solver under the environment flag
COCKROACH_ENABLE_RULE_SOLVER for ease of testing and defaults to not enabled.

This commit re-applies the rule solver, specifically the following commits:

1) 4446345
storage: add constraint rule solver for allocation

Rules are represented as a single function that returns the candidacy of the
store as well as a float value representing the score. These scores are then
aggregated from all rules and returns the stores sorted by them.

Current rules:
- ruleReplicasUniqueNodes ensures that no two replicas are put on the same node.
- ruleConstraints enforces that required and prohibited constraints are
  followed, and that stores with more positive constraints are ranked higher.
- ruleDiversity ensures that nodes that have the fewest locality tiers in common
  are given higher priority.
- ruleCapacity prioritizes placing data on empty nodes when the choice is
  available and prevents data from going onto mostly full nodes.

2) dd3229a
storage: implemented RuleSolver into allocator

The follow up to this commit is cockroachdb#10275 and a lot of testing to ensure that the
rule solver does indeed perform as expected.

Closes cockroachdb#9336
BramGruneir added a commit to BramGruneir/cockroach that referenced this issue Nov 9, 2016
Instead of applying 1ef40f3
or cockroachdb#10252, this finishes the reapplication of the rule solver. However, this
also puts the rule solver under the environment flag
COCKROACH_ENABLE_RULE_SOLVER for ease of testing and defaults to not enabled.

This commit re-applies the rule solver, specifically the following commits:

1) 4446345
storage: add constraint rule solver for allocation

Rules are represented as a single function that returns the candidacy of the
store as well as a float value representing the score. These scores are then
aggregated from all rules and returns the stores sorted by them.

Current rules:
- ruleReplicasUniqueNodes ensures that no two replicas are put on the same node.
- ruleConstraints enforces that required and prohibited constraints are
  followed, and that stores with more positive constraints are ranked higher.
- ruleDiversity ensures that nodes that have the fewest locality tiers in common
  are given higher priority.
- ruleCapacity prioritizes placing data on empty nodes when the choice is
  available and prevents data from going onto mostly full nodes.

2) dd3229a
storage: implemented RuleSolver into allocator

The follow up to this commit is cockroachdb#10275 and a lot of testing to ensure that the
rule solver does indeed perform as expected.

Closes cockroachdb#9336
BramGruneir added a commit to BramGruneir/cockroach that referenced this issue Nov 9, 2016
Instead of applying 1ef40f3
or cockroachdb#10252, this finishes the reapplication of the rule solver. However, this
also puts the rule solver under the environment flag
COCKROACH_ENABLE_RULE_SOLVER for ease of testing and defaults to not enabled.

This commit re-applies the rule solver, specifically the following commits:

1) 4446345
storage: add constraint rule solver for allocation

Rules are represented as a single function that returns the candidacy of the
store as well as a float value representing the score. These scores are then
aggregated from all rules and returns the stores sorted by them.

Current rules:
- ruleReplicasUniqueNodes ensures that no two replicas are put on the same node.
- ruleConstraints enforces that required and prohibited constraints are
  followed, and that stores with more positive constraints are ranked higher.
- ruleDiversity ensures that nodes that have the fewest locality tiers in common
  are given higher priority.
- ruleCapacity prioritizes placing data on empty nodes when the choice is
  available and prevents data from going onto mostly full nodes.

2) dd3229a
storage: implemented RuleSolver into allocator

The follow up to this commit is cockroachdb#10275 and a lot of testing to ensure that the
rule solver does indeed perform as expected.

Closes cockroachdb#9336
@BramGruneir
Copy link
Member Author

@petermattis

What does it mean for a replica to be on an "incorrect replica"?

Cleaned this up. It was worded incorrectly.

"highest constraint score"?
The part that isn't clear to me is where you set the threshold for what is a "highest constraint score" and what is not. Put another way, I'm not clear on what this would look like in code or pseudo-code. Can you elaborate?

Sure. Here's a more completed example:

Rules should return valid bool, constraintScore int, balancingScore int. (I think I prefer ints for the scores, but I'm not 100% on that yet.)

And each rule would be able to return any combination of all three.

valid - is for when there are hard constraints, such as when locality=usa is required.
constraintScore - is for softer constraints that are always vital. This is where diversity and positive constraints add to the score.
balancingScore - is where nice to have options fit. Such as scores with more fewer replicas have a higher score, etc.

The algorithm to determine where a new replica should go is straightforward:

Once the rules are run against the available stores, and all non-valid ones are ruled out, find the collection of stores that have the highest constraint score. From that collection, pick 2 two stores randomly and choose the one with the highest rebalance score. This introduces randomness back into the system and adheres to locality/attribute constraints as much as possible.

This would remove the need for weighting the scores and make it clear how each rule affects rebalancing decisions.

BramGruneir added a commit to BramGruneir/cockroach that referenced this issue Nov 10, 2016
Relying on the range over a map to ensure randomness is a little too subtle.  This is much more explisit.

Part of cockroachdb#10275
BramGruneir added a commit to BramGruneir/cockroach that referenced this issue Nov 11, 2016
Also, add a shuffle to the store pool's getStoreList.

Relying on the range over a map to ensure randomness is a little too subtle.  This is much more explisit.

Part of cockroachdb#10275
BramGruneir added a commit to BramGruneir/cockroach that referenced this issue Nov 11, 2016
Also, add a shuffle to the store pool's getStoreList.

Relying on the range over a map to ensure randomness is a little too subtle.  This is much more explisit.

Part of cockroachdb#10275
@BramGruneir
Copy link
Member Author

I've been thinking a bit more about this. And for now, I think the quickest thing to do is just remove the scoring for free space entirely. That way we can randomize based on the top scoring stores (if there is more than one), which would closely match the current system.

BramGruneir added a commit to BramGruneir/cockroach that referenced this issue Nov 11, 2016
Also, add a shuffle to the store pool's getStoreList.

Relying on the range over a map to ensure randomness is a little too subtle.  This is much more explisit.

Part of cockroachdb#10275
BramGruneir added a commit to BramGruneir/cockroach that referenced this issue Nov 11, 2016
Also, add a shuffle to the store pool's getStoreList.

Relying on the range over a map to ensure randomness is a little too subtle.  This is much more explisit.

Part of cockroachdb#10275
@petermattis
Copy link
Collaborator

balancingScore - is where nice to have options fit. Such as scores with more fewer replicas have a higher score, etc.

more fewer replicas?

find the collection of stores that have the highest constraint score

I still don't understand how this would be done. What threshold would you use to separate "highest constraint score" from lower constraint scores?

@BramGruneir
Copy link
Member Author

Wow, that sentence got mangled:
balancingScore - is where nice to have options fit. Such as stores with fewer replicas should have a higher score, etc.

But as my follow up comment mentioned, I don't think we need it right now.

Constraint scores are all rules that are additive.
There are really only 2 rules for this right now:

  • ConstraintRule - Each positive constraint (not required or negative ones, since those only rule out stores) would add 1 to the score for that store.
  • DiversityRule - Look at the most common locality for each level, and if the store is different than the most common, add 1 to the score store. (This rule needs to be redefined a little, but that's the general idea.)

So from the group of available stores, find the one(s) that share the highest score. If there's more than one, pick a random one.

BramGruneir added a commit to BramGruneir/cockroach that referenced this issue Dec 4, 2016
The first one ensures we never try to overfill a store while
the second generates a balance score based on how full the target
store is.

Part of cockroachdb#10275
BramGruneir added a commit to BramGruneir/cockroach that referenced this issue Dec 8, 2016
- Splits the scores returned in the rule solver into a constraint and balance scores.
- Add a valid field to constraints and add it to all rules.
- Solve now returns all candidates instead of just the valid ones. To get only the valid candidates, the new function onlyValid and new type  condidateList have also been added.
- This allows us to use solve for removeTarget. It also cleans up the logic in removeTarget to more closely match the non-rule solver version.
- Split the capcity rules into two rules. They were performing two different operations and didnt' make sense being combined. This will also ease the change of converting the rules to basic functions.

Part of cockroachdb#10275
BramGruneir added a commit to BramGruneir/cockroach that referenced this issue Dec 9, 2016
- Splits the scores returned in the rule solver into a constraint and balance scores.
- Add a valid field to constraints and add it to all rules.
- Solve now returns all candidates instead of just the valid ones. To get only the valid candidates, the new function onlyValid and new type  condidateList have also been added.
- This allows us to use solve for removeTarget. It also cleans up the logic in removeTarget to more closely match the non-rule solver version.
- Split the capcity rules into two rules. They were performing two different operations and didnt' make sense being combined. This will also ease the change of converting the rules to basic functions.

Part of cockroachdb#10275
BramGruneir added a commit to BramGruneir/cockroach that referenced this issue Dec 12, 2016
- Splits the scores returned in the rule solver into a constraint and balance scores.
- Add a valid field to constraints and add it to all rules.
- Solve now returns all candidates instead of just the valid ones. To get only the valid candidates, the new function onlyValid and new type  condidateList have also been added.
- This allows us to use solve for removeTarget. It also cleans up the logic in removeTarget to more closely match the non-rule solver version.
- Split the capcity rules into two rules. They were performing two different operations and didnt' make sense being combined. This will also ease the change of converting the rules to basic functions.

Part of cockroachdb#10275
BramGruneir added a commit to BramGruneir/cockroach that referenced this issue Dec 12, 2016
- Splits the scores returned in the rule solver into a constraint and balance scores.
- Add a valid field to constraints and add it to all rules.
- Solve now returns all candidates instead of just the valid ones. To get only the valid candidates, the new function onlyValid and new type  condidateList have also been added.
- This allows us to use solve for removeTarget. It also cleans up the logic in removeTarget to more closely match the non-rule solver version.
- Split the capcity rules into two rules. They were performing two different operations and didnt' make sense being combined. This will also ease the change of converting the rules to basic functions.

Part of cockroachdb#10275
BramGruneir added a commit to BramGruneir/cockroach that referenced this issue Dec 13, 2016
BramGruneir added a commit to BramGruneir/cockroach that referenced this issue Dec 16, 2016
This gets rid of the wired 1/(1+rangecount) that was there before. If a
more complex scoring is required, it can be added then.

Part of cockroachdb#10275.
BramGruneir added a commit to BramGruneir/cockroach that referenced this issue Dec 19, 2016
This gets rid of the wired 1/(1+rangecount) that was there before. If a
more complex scoring is required, it can be added then.

Part of cockroachdb#10275.
@petermattis petermattis added this to the 1.0 milestone Feb 23, 2017
@dianasaur323
Copy link
Contributor

dianasaur323 commented Apr 12, 2017

@BramGruneir how much of this issue is still relevant, or is it outdated at this point?

@BramGruneir
Copy link
Member Author

Most of this issue is complete and only contains some cleanup or new tools that's not relevant for 1.0

@BramGruneir BramGruneir modified the milestones: Later, 1.0 Apr 17, 2017
@petermattis
Copy link
Collaborator

@BramGruneir Is there anything left to do here? Perhaps worthwhile to file new issues about remaining work and close this down.

@BramGruneir
Copy link
Member Author

There are 5 issues left. I'd like to get to them at some point. I figured after my 1.1 tasks are completed.

But none of them are needed for 1.1. What would be the point of adding in a new issue instead of keeping this open?

@petermattis
Copy link
Collaborator

But none of them are needed for 1.1. What would be the point of adding in a new issue instead of keeping this open?

Clarity. It can be overwhelming to read the above every time this issue is visited. Also, some of the tasks don't have associated issues. Fine to leave this open too.

@petermattis petermattis added C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. A-kv-replication Relating to Raft, consensus, and coordination. labels Jul 21, 2018
@petermattis
Copy link
Collaborator

@BramGruneir Is there anything left to be done here? If yes, please file new issues and close this one.

@BramGruneir
Copy link
Member Author

This is all finished.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior.
Projects
None yet
Development

No branches or pull requests

6 participants