-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: rule solver post merge issues #10275
Comments
This adds back in 3 commits that were removed to facilitate the merge of develop back to master. One other commit, is no longer required. Follow up fixes are tracked in cockroachdb#10275. Closes cockroachdb#9336 1) 4446345 storage: add constraint rule solver for allocation Rules are represented as a single function that returns the candidacy of the store as well as a float value representing the score. These scores are then aggregated from all rules and returns the stores sorted by them. Current rules: - ruleReplicasUniqueNodes ensures that no two replicas are put on the same node. - ruleConstraints enforces that required and prohibited constraints are followed, and that stores with more positive constraints are ranked higher. - ruleDiversity ensures that nodes that have the fewest locality tiers in common are given higher priority. - ruleCapacity prioritizes placing data on empty nodes when the choice is available and prevents data from going onto mostly full nodes. 2) dd3229a storage: implemented RuleSolver into allocator 3) 27353a8 storage: removed unused rangeCountBalancer There was a 4th commit that is no longer required. The simulation was already converging since adding a rebalance threshold. 4e29a36 storage/simulation: only rebalance 50% of ranges on each iteration so it will converge
What does it mean for a replica to be on an "incorrect replica"? |
I'd add that the rule structure can be inflexible. For example, deciding to expand the list of candidates to include stores which don't match a positive constraint is awkward to fit into such an approach. |
Doesn't the rule solver already take into account stores that don't match positive constraints? Stores with more positive constraints will have a higher score than stores with fewer matching constraints. |
That's a good point. I guess we'd need to understand how hard we try to On Thursday, October 27, 2016, Tristan Rice [email protected]
|
I think this can be solved easily using two scores instead of one. This would be a lot cleaner than the current solution which uses weights of 0.01 for rebalancing heuristic while 1.00 for constraints. And of course, any rule can invalidate store entirely.
|
What does it mean to "find all stores that have the highest constraint score"? I would have imagined the constraint rule to filter the candidates, not apply a score. If a store does not match a required or positive constraint, we definitely want to exclude it from the list of candidate, right? |
We have different type of constraints. see: https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/expressive_zone_config.md When a required or negative constraint is violated, the rule returns false and the store is ruled out. For positive constraints the store returns a true, regardless if it matches or not, and instead increases the score if it does match. So if we split our rules into heuristics and constraint rules, we can quickly find all the stores that have the highest constraint store and then randomly choose from those based on their heuristic scores. (n.b. that heuristic constraints can still rule out of a store if it is overfilled.) |
"highest constraint score"? The part that isn't clear to me is where you set the threshold for what is a "highest constraint score" and what is not. Put another way, I'm not clear on what this would look like in code or pseudo-code. Can you elaborate? |
Instead of applying cockroachdb@1ef40f3 or cockroachdb#10252, this finishes the reapplication of the rule solver. However, this also puts the rule solver under the environment flag COCKROACH_ENABLE_RULE_SOLVER for ease of testing and defaults to not enabled. The follow up to this commit is cockroachdb#10275 and a lot of testing to ensure that the rule solver does indeed perform as expected. Closes cockroachdb#9336
What happens if you define locality tiers inconsistently? The RFC says:
Is that true? Also, what do we do to inform the user that a constraint does not match any existing locality/attribute? Is it possible to validate constraints when a zone config is set? Users could add constraints and then update localities/attributes, so I imagine this would just be a warning, but in cases where there's a real mistake, this could really help. |
That statement is true in the current code. The original idea was to have it log to console as well as display errors as a banner in the UI. There's a PR sitting around somewhere to do that. |
Instead of applying 1ef40f3 or cockroachdb#10252, this finishes the reapplication of the rule solver. However, this also puts the rule solver under the environment flag COCKROACH_ENABLE_RULE_SOLVER for ease of testing and defaults to not enabled. This commit re-applies the rule solver, specifically the following commits: 1) 4446345 storage: add constraint rule solver for allocation Rules are represented as a single function that returns the candidacy of the store as well as a float value representing the score. These scores are then aggregated from all rules and returns the stores sorted by them. Current rules: - ruleReplicasUniqueNodes ensures that no two replicas are put on the same node. - ruleConstraints enforces that required and prohibited constraints are followed, and that stores with more positive constraints are ranked higher. - ruleDiversity ensures that nodes that have the fewest locality tiers in common are given higher priority. - ruleCapacity prioritizes placing data on empty nodes when the choice is available and prevents data from going onto mostly full nodes. 2) dd3229a storage: implemented RuleSolver into allocator The follow up to this commit is cockroachdb#10275 and a lot of testing to ensure that the rule solver does indeed perform as expected. Closes cockroachdb#9336
Instead of applying 1ef40f3 or cockroachdb#10252, this finishes the reapplication of the rule solver. However, this also puts the rule solver under the environment flag COCKROACH_ENABLE_RULE_SOLVER for ease of testing and defaults to not enabled. This commit re-applies the rule solver, specifically the following commits: 1) 4446345 storage: add constraint rule solver for allocation Rules are represented as a single function that returns the candidacy of the store as well as a float value representing the score. These scores are then aggregated from all rules and returns the stores sorted by them. Current rules: - ruleReplicasUniqueNodes ensures that no two replicas are put on the same node. - ruleConstraints enforces that required and prohibited constraints are followed, and that stores with more positive constraints are ranked higher. - ruleDiversity ensures that nodes that have the fewest locality tiers in common are given higher priority. - ruleCapacity prioritizes placing data on empty nodes when the choice is available and prevents data from going onto mostly full nodes. 2) dd3229a storage: implemented RuleSolver into allocator The follow up to this commit is cockroachdb#10275 and a lot of testing to ensure that the rule solver does indeed perform as expected. Closes cockroachdb#9336
Instead of applying 1ef40f3 or cockroachdb#10252, this finishes the reapplication of the rule solver. However, this also puts the rule solver under the environment flag COCKROACH_ENABLE_RULE_SOLVER for ease of testing and defaults to not enabled. This commit re-applies the rule solver, specifically the following commits: 1) 4446345 storage: add constraint rule solver for allocation Rules are represented as a single function that returns the candidacy of the store as well as a float value representing the score. These scores are then aggregated from all rules and returns the stores sorted by them. Current rules: - ruleReplicasUniqueNodes ensures that no two replicas are put on the same node. - ruleConstraints enforces that required and prohibited constraints are followed, and that stores with more positive constraints are ranked higher. - ruleDiversity ensures that nodes that have the fewest locality tiers in common are given higher priority. - ruleCapacity prioritizes placing data on empty nodes when the choice is available and prevents data from going onto mostly full nodes. 2) dd3229a storage: implemented RuleSolver into allocator The follow up to this commit is cockroachdb#10275 and a lot of testing to ensure that the rule solver does indeed perform as expected. Closes cockroachdb#9336
Cleaned this up. It was worded incorrectly.
Sure. Here's a more completed example: Rules should return And each rule would be able to return any combination of all three. valid - is for when there are hard constraints, such as when The algorithm to determine where a new replica should go is straightforward: Once the rules are run against the available stores, and all non-valid ones are ruled out, find the collection of stores that have the highest constraint score. From that collection, pick 2 two stores randomly and choose the one with the highest rebalance score. This introduces randomness back into the system and adheres to locality/attribute constraints as much as possible. This would remove the need for weighting the scores and make it clear how each rule affects rebalancing decisions. |
Relying on the range over a map to ensure randomness is a little too subtle. This is much more explisit. Part of cockroachdb#10275
Also, add a shuffle to the store pool's getStoreList. Relying on the range over a map to ensure randomness is a little too subtle. This is much more explisit. Part of cockroachdb#10275
Also, add a shuffle to the store pool's getStoreList. Relying on the range over a map to ensure randomness is a little too subtle. This is much more explisit. Part of cockroachdb#10275
I've been thinking a bit more about this. And for now, I think the quickest thing to do is just remove the scoring for free space entirely. That way we can randomize based on the top scoring stores (if there is more than one), which would closely match the current system. |
Also, add a shuffle to the store pool's getStoreList. Relying on the range over a map to ensure randomness is a little too subtle. This is much more explisit. Part of cockroachdb#10275
Also, add a shuffle to the store pool's getStoreList. Relying on the range over a map to ensure randomness is a little too subtle. This is much more explisit. Part of cockroachdb#10275
I still don't understand how this would be done. What threshold would you use to separate "highest constraint score" from lower constraint scores? |
Wow, that sentence got mangled: But as my follow up comment mentioned, I don't think we need it right now. Constraint scores are all rules that are additive.
So from the group of available stores, find the one(s) that share the highest score. If there's more than one, pick a random one. |
The first one ensures we never try to overfill a store while the second generates a balance score based on how full the target store is. Part of cockroachdb#10275
- Splits the scores returned in the rule solver into a constraint and balance scores. - Add a valid field to constraints and add it to all rules. - Solve now returns all candidates instead of just the valid ones. To get only the valid candidates, the new function onlyValid and new type condidateList have also been added. - This allows us to use solve for removeTarget. It also cleans up the logic in removeTarget to more closely match the non-rule solver version. - Split the capcity rules into two rules. They were performing two different operations and didnt' make sense being combined. This will also ease the change of converting the rules to basic functions. Part of cockroachdb#10275
- Splits the scores returned in the rule solver into a constraint and balance scores. - Add a valid field to constraints and add it to all rules. - Solve now returns all candidates instead of just the valid ones. To get only the valid candidates, the new function onlyValid and new type condidateList have also been added. - This allows us to use solve for removeTarget. It also cleans up the logic in removeTarget to more closely match the non-rule solver version. - Split the capcity rules into two rules. They were performing two different operations and didnt' make sense being combined. This will also ease the change of converting the rules to basic functions. Part of cockroachdb#10275
- Splits the scores returned in the rule solver into a constraint and balance scores. - Add a valid field to constraints and add it to all rules. - Solve now returns all candidates instead of just the valid ones. To get only the valid candidates, the new function onlyValid and new type condidateList have also been added. - This allows us to use solve for removeTarget. It also cleans up the logic in removeTarget to more closely match the non-rule solver version. - Split the capcity rules into two rules. They were performing two different operations and didnt' make sense being combined. This will also ease the change of converting the rules to basic functions. Part of cockroachdb#10275
- Splits the scores returned in the rule solver into a constraint and balance scores. - Add a valid field to constraints and add it to all rules. - Solve now returns all candidates instead of just the valid ones. To get only the valid candidates, the new function onlyValid and new type condidateList have also been added. - This allows us to use solve for removeTarget. It also cleans up the logic in removeTarget to more closely match the non-rule solver version. - Split the capcity rules into two rules. They were performing two different operations and didnt' make sense being combined. This will also ease the change of converting the rules to basic functions. Part of cockroachdb#10275
This gets rid of the wired 1/(1+rangecount) that was there before. If a more complex scoring is required, it can be added then. Part of cockroachdb#10275.
This gets rid of the wired 1/(1+rangecount) that was there before. If a more complex scoring is required, it can be added then. Part of cockroachdb#10275.
@BramGruneir how much of this issue is still relevant, or is it outdated at this point? |
Most of this issue is complete and only contains some cleanup or new tools that's not relevant for 1.0 |
@BramGruneir Is there anything left to do here? Perhaps worthwhile to file new issues about remaining work and close this down. |
There are 5 issues left. I'd like to get to them at some point. I figured after my 1.1 tasks are completed. But none of them are needed for 1.1. What would be the point of adding in a new issue instead of keeping this open? |
Clarity. It can be overwhelming to read the above every time this issue is visited. Also, some of the tasks don't have associated issues. Fine to leave this open too. |
@BramGruneir Is there anything left to be done here? If yes, please file new issues and close this one. |
This is all finished. |
During the re-introduction of the rule solver, there were a number of issues that popped up. This issue will tack those.
See #10252, #10507
[WIP] storage: add some randomness to the rule solver #11202Add in some randomness to the rebalance and allocate targets.Add a corrupt replica test to TestRuleSolver.No longer needed as this is no longer checked using the store pool.storage: split ruleCapacity into two rules #11721Consider splitting ruleCapacity into two rules.storage: remove pointers from allocator returns #11206 Allocator's RebalanceTarget and AllocateTarget should not return a pointer.This change was not useful so dumping it.scores
to the candidate's score fields names.The text was updated successfully, but these errors were encountered: