storage: rule solver post merge issues #10275

BramGruneir · 2016-10-27T20:32:01Z

During the re-introduction of the rule solver, there were a number of issues that popped up. This issue will tack those.

See #10252, #10507

The text was updated successfully, but these errors were encountered:

This adds back in 3 commits that were removed to facilitate the merge of develop back to master. One other commit, is no longer required. Follow up fixes are tracked in cockroachdb#10275. Closes cockroachdb#9336 1) 4446345 storage: add constraint rule solver for allocation Rules are represented as a single function that returns the candidacy of the store as well as a float value representing the score. These scores are then aggregated from all rules and returns the stores sorted by them. Current rules: - ruleReplicasUniqueNodes ensures that no two replicas are put on the same node. - ruleConstraints enforces that required and prohibited constraints are followed, and that stores with more positive constraints are ranked higher. - ruleDiversity ensures that nodes that have the fewest locality tiers in common are given higher priority. - ruleCapacity prioritizes placing data on empty nodes when the choice is available and prevents data from going onto mostly full nodes. 2) dd3229a storage: implemented RuleSolver into allocator 3) 27353a8 storage: removed unused rangeCountBalancer There was a 4th commit that is no longer required. The simulation was already converging since adding a rebalance threshold. 4e29a36 storage/simulation: only rebalance 50% of ranges on each iteration so it will converge

petermattis · 2016-10-27T21:34:47Z

Add in the ability for a replica to realize that it is on an incorrect replica.

What does it mean for a replica to be on an "incorrect replica"?

petermattis · 2016-10-27T21:36:23Z

Having a configurable list of rules is overkill.

I'd add that the rule structure can be inflexible. For example, deciding to expand the list of candidates to include stores which don't match a positive constraint is awkward to fit into such an approach.

d4l3k · 2016-10-27T22:10:58Z

Doesn't the rule solver already take into account stores that don't match positive constraints? Stores with more positive constraints will have a higher score than stores with fewer matching constraints.

petermattis · 2016-10-27T22:21:20Z

That's a good point. I guess we'd need to understand how hard we try to
match the positive constraints. Randomness can throw a wrench in there. For
example, if you have a small subset of nodes which match the positive
constraint, randomness at the candidate level would make selection of those
nodes unlikely.

On Thursday, October 27, 2016, Tristan Rice [email protected]
wrote:

Doesn't the existing system already take into account stores that don't
match positive constraints? Stores with more positive constraints will have
a higher score than stores with fewer matching constraints.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#10275 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AF6f90klN8SgGFYUy5MbJqvtOrNRIidMks5q4SGDgaJpZM4Kiyrl
.

BramGruneir · 2016-11-02T20:47:18Z

I think this can be solved easily using two scores instead of one.
The first is the constraint score, the second is heuristic load balancing score.
Find all stores that have the highest constraint score and randomly choose two from this pool and pick the one with the highest balancing score.

This would be a lot cleaner than the current solution which uses weights of 0.01 for rebalancing heuristic while 1.00 for constraints. And of course, any rule can invalidate store entirely.

    {
        weight: 1.0,
        run:    ruleReplicasUniqueNodes,
    },
    {
        weight: 1.0,
        run:    ruleConstraints,
    },
    {
        weight: 0.01,
        run:    ruleCapacity,
    },
    {
        weight: 0.1,
        run:    ruleDiversity,
    },

petermattis · 2016-11-03T12:53:15Z

Find all stores that have the highest constraint score and randomly choose two from this pool and pick the one with the highest balancing score.

What does it mean to "find all stores that have the highest constraint score"? I would have imagined the constraint rule to filter the candidates, not apply a score. If a store does not match a required or positive constraint, we definitely want to exclude it from the list of candidate, right?

BramGruneir · 2016-11-03T15:56:09Z

We have different type of constraints.
Positive, Required and Negative. Required and negative constraints rule out stores, while positive constraints aim for diversity.

see: https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/expressive_zone_config.md

When a required or negative constraint is violated, the rule returns false and the store is ruled out. For positive constraints the store returns a true, regardless if it matches or not, and instead increases the score if it does match. So if we split our rules into heuristics and constraint rules, we can quickly find all the stores that have the highest constraint store and then randomly choose from those based on their heuristic scores. (n.b. that heuristic constraints can still rule out of a store if it is overfilled.)

petermattis · 2016-11-03T16:02:23Z

So if we split our rules into heuristics and constraint rules, we can quickly find all the stores that have the highest constraint store and then randomly choose from those based on their heuristic scores.

"highest constraint score"?

The part that isn't clear to me is where you set the threshold for what is a "highest constraint score" and what is not. Put another way, I'm not clear on what this would look like in code or pseudo-code. Can you elaborate?

Instead of applying cockroachdb@1ef40f3 or cockroachdb#10252, this finishes the reapplication of the rule solver. However, this also puts the rule solver under the environment flag COCKROACH_ENABLE_RULE_SOLVER for ease of testing and defaults to not enabled. The follow up to this commit is cockroachdb#10275 and a lot of testing to ensure that the rule solver does indeed perform as expected. Closes cockroachdb#9336

jseldess · 2016-11-07T21:36:39Z

What happens if you define locality tiers inconsistently? The RFC says:

If the order of tags on one node doesn't match those on others, we would warn the user. In such a case, we will use the order that the majority of nodes have.

Is that true?

Also, what do we do to inform the user that a constraint does not match any existing locality/attribute? Is it possible to validate constraints when a zone config is set? Users could add constraints and then update localities/attributes, so I imagine this would just be a warning, but in cases where there's a real mistake, this could really help.

d4l3k · 2016-11-07T23:23:24Z

That statement is true in the current code.

The original idea was to have it log to console as well as display errors as a banner in the UI. There's a PR sitting around somewhere to do that.
As for matching good constraints, I wrote a tool as part of this for checking the effect of the constraints.

#8887

Instead of applying 1ef40f3 or cockroachdb#10252, this finishes the reapplication of the rule solver. However, this also puts the rule solver under the environment flag COCKROACH_ENABLE_RULE_SOLVER for ease of testing and defaults to not enabled. This commit re-applies the rule solver, specifically the following commits: 1) 4446345 storage: add constraint rule solver for allocation Rules are represented as a single function that returns the candidacy of the store as well as a float value representing the score. These scores are then aggregated from all rules and returns the stores sorted by them. Current rules: - ruleReplicasUniqueNodes ensures that no two replicas are put on the same node. - ruleConstraints enforces that required and prohibited constraints are followed, and that stores with more positive constraints are ranked higher. - ruleDiversity ensures that nodes that have the fewest locality tiers in common are given higher priority. - ruleCapacity prioritizes placing data on empty nodes when the choice is available and prevents data from going onto mostly full nodes. 2) dd3229a storage: implemented RuleSolver into allocator The follow up to this commit is cockroachdb#10275 and a lot of testing to ensure that the rule solver does indeed perform as expected. Closes cockroachdb#9336

BramGruneir · 2016-11-10T18:32:04Z

@petermattis

What does it mean for a replica to be on an "incorrect replica"?

Cleaned this up. It was worded incorrectly.

"highest constraint score"?
The part that isn't clear to me is where you set the threshold for what is a "highest constraint score" and what is not. Put another way, I'm not clear on what this would look like in code or pseudo-code. Can you elaborate?

Sure. Here's a more completed example:

Rules should return valid bool, constraintScore int, balancingScore int. (I think I prefer ints for the scores, but I'm not 100% on that yet.)

And each rule would be able to return any combination of all three.

valid - is for when there are hard constraints, such as when locality=usa is required.
constraintScore - is for softer constraints that are always vital. This is where diversity and positive constraints add to the score.
balancingScore - is where nice to have options fit. Such as scores with more fewer replicas have a higher score, etc.

The algorithm to determine where a new replica should go is straightforward:

Once the rules are run against the available stores, and all non-valid ones are ruled out, find the collection of stores that have the highest constraint score. From that collection, pick 2 two stores randomly and choose the one with the highest rebalance score. This introduces randomness back into the system and adheres to locality/attribute constraints as much as possible.

This would remove the need for weighting the scores and make it clear how each rule affects rebalancing decisions.

Relying on the range over a map to ensure randomness is a little too subtle. This is much more explisit. Part of cockroachdb#10275

Also, add a shuffle to the store pool's getStoreList. Relying on the range over a map to ensure randomness is a little too subtle. This is much more explisit. Part of cockroachdb#10275

BramGruneir · 2016-11-11T21:00:17Z

I've been thinking a bit more about this. And for now, I think the quickest thing to do is just remove the scoring for free space entirely. That way we can randomize based on the top scoring stores (if there is more than one), which would closely match the current system.

Also, add a shuffle to the store pool's getStoreList. Relying on the range over a map to ensure randomness is a little too subtle. This is much more explisit. Part of cockroachdb#10275

petermattis · 2016-11-11T22:08:11Z

balancingScore - is where nice to have options fit. Such as scores with more fewer replicas have a higher score, etc.

more fewer replicas?

find the collection of stores that have the highest constraint score

I still don't understand how this would be done. What threshold would you use to separate "highest constraint score" from lower constraint scores?

BramGruneir · 2016-11-14T16:38:50Z

Wow, that sentence got mangled:
balancingScore - is where nice to have options fit. Such as stores with fewer replicas should have a higher score, etc.

But as my follow up comment mentioned, I don't think we need it right now.

Constraint scores are all rules that are additive.
There are really only 2 rules for this right now:

ConstraintRule - Each positive constraint (not required or negative ones, since those only rule out stores) would add 1 to the score for that store.
DiversityRule - Look at the most common locality for each level, and if the store is different than the most common, add 1 to the score store. (This rule needs to be redefined a little, but that's the general idea.)

So from the group of available stores, find the one(s) that share the highest score. If there's more than one, pick a random one.

The first one ensures we never try to overfill a store while the second generates a balance score based on how full the target store is. Part of cockroachdb#10275

- Splits the scores returned in the rule solver into a constraint and balance scores. - Add a valid field to constraints and add it to all rules. - Solve now returns all candidates instead of just the valid ones. To get only the valid candidates, the new function onlyValid and new type condidateList have also been added. - This allows us to use solve for removeTarget. It also cleans up the logic in removeTarget to more closely match the non-rule solver version. - Split the capcity rules into two rules. They were performing two different operations and didnt' make sense being combined. This will also ease the change of converting the rules to basic functions. Part of cockroachdb#10275

Part of cockroachdb#10275.

This gets rid of the wired 1/(1+rangecount) that was there before. If a more complex scoring is required, it can be added then. Part of cockroachdb#10275.

dianasaur323 · 2017-04-12T14:53:55Z

@BramGruneir how much of this issue is still relevant, or is it outdated at this point?

BramGruneir · 2017-04-17T15:27:39Z

Most of this issue is complete and only contains some cleanup or new tools that's not relevant for 1.0

petermattis · 2017-06-01T14:27:13Z

@BramGruneir Is there anything left to do here? Perhaps worthwhile to file new issues about remaining work and close this down.

BramGruneir · 2017-06-01T18:49:56Z

There are 5 issues left. I'd like to get to them at some point. I figured after my 1.1 tasks are completed.

But none of them are needed for 1.1. What would be the point of adding in a new issue instead of keeping this open?

petermattis · 2017-06-01T19:02:00Z

But none of them are needed for 1.1. What would be the point of adding in a new issue instead of keeping this open?

Clarity. It can be overwhelming to read the above every time this issue is visited. Also, some of the tasks don't have associated issues. Fine to leave this open too.

petermattis · 2018-07-21T15:10:15Z

@BramGruneir Is there anything left to be done here? If yes, please file new issues and close this one.

BramGruneir · 2018-07-23T14:18:59Z

This is all finished.

BramGruneir self-assigned this Oct 27, 2016

BramGruneir mentioned this issue Nov 7, 2016

storage: Re-apply the rule solver. #10507

Merged

BramGruneir added a commit to BramGruneir/cockroach that referenced this issue Nov 10, 2016

storage: add a shuffle to the store pool's getStoreList.

4ea8998

Relying on the range over a map to ensure randomness is a little too subtle. This is much more explisit. Part of cockroachdb#10275

BramGruneir mentioned this issue Nov 10, 2016

storage: add a shuffle to the store pool's getStoreList. #10607

Merged

dianasaur323 mentioned this issue Dec 5, 2016

Product Roadmap Q4 2016 #10528

Closed

37 tasks

BramGruneir added a commit to BramGruneir/cockroach that referenced this issue Dec 13, 2016

storage: rename candidates score fields to have score in their name

2d3d204

Part of cockroachdb#10275.

BramGruneir mentioned this issue Dec 13, 2016

storage: rename candidates score fields to have score in their name #12326

Merged

BramGruneir mentioned this issue Dec 16, 2016

storage: update capacity score to contain range counts #12458

Merged

petermattis added this to the 1.0 milestone Feb 23, 2017

dianasaur323 added the high priority label Mar 14, 2017

BramGruneir modified the milestones: Later, 1.0 Apr 17, 2017

BramGruneir removed the high priority label Apr 17, 2017

petermattis added C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. A-kv-replication Relating to Raft, consensus, and coordination. labels Jul 21, 2018

petermattis unassigned BramGruneir Jul 21, 2018

BramGruneir closed this as completed Jul 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: rule solver post merge issues #10275

storage: rule solver post merge issues #10275

BramGruneir commented Oct 27, 2016 •

edited by a-robinson

Loading

petermattis commented Oct 27, 2016

petermattis commented Oct 27, 2016

d4l3k commented Oct 27, 2016 •

edited

Loading

petermattis commented Oct 27, 2016

BramGruneir commented Nov 2, 2016

petermattis commented Nov 3, 2016

BramGruneir commented Nov 3, 2016

petermattis commented Nov 3, 2016

jseldess commented Nov 7, 2016 •

edited

Loading

d4l3k commented Nov 7, 2016

BramGruneir commented Nov 10, 2016

BramGruneir commented Nov 11, 2016

petermattis commented Nov 11, 2016

BramGruneir commented Nov 14, 2016

dianasaur323 commented Apr 12, 2017 •

edited

Loading

BramGruneir commented Apr 17, 2017

petermattis commented Jun 1, 2017

BramGruneir commented Jun 1, 2017

petermattis commented Jun 1, 2017

petermattis commented Jul 21, 2018

BramGruneir commented Jul 23, 2018

storage: rule solver post merge issues #10275

storage: rule solver post merge issues #10275

Comments

BramGruneir commented Oct 27, 2016 • edited by a-robinson Loading

petermattis commented Oct 27, 2016

petermattis commented Oct 27, 2016

d4l3k commented Oct 27, 2016 • edited Loading

petermattis commented Oct 27, 2016

BramGruneir commented Nov 2, 2016

petermattis commented Nov 3, 2016

BramGruneir commented Nov 3, 2016

petermattis commented Nov 3, 2016

jseldess commented Nov 7, 2016 • edited Loading

d4l3k commented Nov 7, 2016

BramGruneir commented Nov 10, 2016

BramGruneir commented Nov 11, 2016

petermattis commented Nov 11, 2016

BramGruneir commented Nov 14, 2016

dianasaur323 commented Apr 12, 2017 • edited Loading

BramGruneir commented Apr 17, 2017

petermattis commented Jun 1, 2017

BramGruneir commented Jun 1, 2017

petermattis commented Jun 1, 2017

petermattis commented Jul 21, 2018

BramGruneir commented Jul 23, 2018

BramGruneir commented Oct 27, 2016 •

edited by a-robinson

Loading

d4l3k commented Oct 27, 2016 •

edited

Loading

jseldess commented Nov 7, 2016 •

edited

Loading

dianasaur323 commented Apr 12, 2017 •

edited

Loading