storage: replace the rule solver #12165

BramGruneir · 2016-12-08T03:12:22Z

The 1st commit is a collection of fixes and updates to the rule solver.

The 2nd commit adds back in the randomness that was missing from the rule solver.

The 3rd commit removes the rule solver altogether and replaces it with a collection of functions for the rules and 3 functions for allocation, removal and rebalancing.

There's more work to do here after merging, but this will bring us close enough to start testing rebalancing using this more expressive system.

With this PR #11702 and #11721 can be closed.

This change is

petermattis · 2016-12-08T17:10:28Z

Review status: 0 of 4 files reviewed at latest revision, 13 unresolved discussions, all commit checks successful.

pkg/storage/allocator.go, line 362 at r8 (raw file):

	if a.options.UseRuleSolver {
		// TODO(bram): ShouldRebalance should be part of rebalanceCandidates
		// and decision made afterward, not it's own function.

You probably have a reason in mind. Add it to the comment.

pkg/storage/allocator.go, line 392 at r8 (raw file):

		// Find all candidates that are better than the worst existing replica.
		targets := candidates.betterThan(existingCandidates[len(existingCandidates)-1])

This will crash if len(existingCandidates) == 0 for some reason.

pkg/storage/rule_solver.go, line 33 at r8 (raw file):

	valid      bool
	constraint float64 // Score used to pick the top candidates.
	balance    float64 // Score used to choose between top candidates.

s/balance/capacity/g to reflect that this is always initialized to capacityScore.

pkg/storage/rule_solver.go, line 37 at r8 (raw file):

func (c candidate) String() string {
	return fmt.Sprintf("StoreID:%d, valid:%t, con:%.2f, bal:%.2f",

s/StoreID:%d/s%d to match the format used elsewhere.

pkg/storage/rule_solver.go, line 96 at r8 (raw file):

func (c byScoreAndID) Swap(i, j int) { c[i], c[j] = c[j], c[i] }

// onlyValid returns all the elements in a sorted candidate list that are valid.

You should add a comment to all of these candidateList methods that they require the candidates to be sorted (and which sorting is required).

pkg/storage/rule_solver.go, line 115 at r8 (raw file):

	for i := 1; i < len(cl); i++ {
		if cl[i].constraint < cl[0].constraint {
			return cl[0:i]

s/0:i/i/g

pkg/storage/rule_solver.go, line 127 at r8 (raw file):

		return cl
	}
	// Are there invalid values? If so, pick those.

s/values/candidates/g

pkg/storage/rule_solver.go, line 156 at r8 (raw file):

// selectGood randomly chooses a good candidate from a sorted candidate list
// using the provided random generator.
func (cl candidateList) selectGood(randGen allocatorRand) candidate {

Return a *candidate so that you don't require the caller to perform an allocation.

pkg/storage/rule_solver.go, line 164 at r8 (raw file):

	order := randGen.Perm(len(cl))
	randGen.Unlock()
	best := cl[order[0]]

This will crash if len(cl) == 0.

pkg/storage/rule_solver.go, line 183 at r8 (raw file):

	order := randGen.Perm(len(cl))
	randGen.Unlock()
	worst := cl[order[0]]

Ditto crash mentioned above.

pkg/storage/rule_solver.go, line 234 at r8 (raw file):

// removeCandidates creates a candidate list of all existing replicas' stores
// ordered from least qualified for removal to most qualified. Stores that are
// marked as not valid, are in violation of a criteria.

s/criteria/required criteria/g?

pkg/storage/rule_solver.go, line 254 at r8 (raw file):

		constraintScore := diversityRemovalScore(s.Node.NodeID, existingNodeLocalities) + float64(preferredMatched)
		if rebalanceToConvergesOnMean(sl, s) {
			constraintScore++

What does it mean for rebalanceToConvergesOnMean to affect constraintScore?

pkg/storage/rule_solver.go, line 439 at r8 (raw file):

// the highest scores.
func capacityScore(store roachpb.StoreDescriptor) float64 {
	return 1.0 / float64(store.Capacity.RangeCount+1)

Does this need to return a value between 0 and 1 now? You're randomly selecting from the candidates with the same constraint score and then only comparing this value against itself.

Comments from Reviewable

BramGruneir · 2016-12-09T04:33:23Z

Review status: 0 of 4 files reviewed at latest revision, 13 unresolved discussions, some commit checks failed.

pkg/storage/allocator.go, line 362 at r8 (raw file):

Previously, petermattis (Peter Mattis) wrote…

You probably have a reason in mind. Add it to the comment.

Done.

pkg/storage/allocator.go, line 392 at r8 (raw file):

Previously, petermattis (Peter Mattis) wrote…

This will crash if len(existingCandidates) == 0 for some reason.

fixed

pkg/storage/rule_solver.go, line 33 at r8 (raw file):

Previously, petermattis (Peter Mattis) wrote…

s/balance/capacity/g to reflect that this is always initialized to capacityScore.

Done.

pkg/storage/rule_solver.go, line 37 at r8 (raw file):

Previously, petermattis (Peter Mattis) wrote…

s/StoreID:%d/s%d to match the format used elsewhere.

Done.

pkg/storage/rule_solver.go, line 96 at r8 (raw file):

Previously, petermattis (Peter Mattis) wrote…

You should add a comment to all of these candidateList methods that they require the candidates to be sorted (and which sorting is required).

They already said "sorted", but I've added a "(by score reversed)" to them.

pkg/storage/rule_solver.go, line 115 at r8 (raw file):

Previously, petermattis (Peter Mattis) wrote…

s/0:i/i/g

you meant :i right? If so, done.

pkg/storage/rule_solver.go, line 127 at r8 (raw file):

Previously, petermattis (Peter Mattis) wrote…

s/values/candidates/g

Done.

pkg/storage/rule_solver.go, line 156 at r8 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Return a *candidate so that you don't require the caller to perform an allocation.

Done. Actually, changed this and selectBad to return a pointer to a storeDescriptor.

pkg/storage/rule_solver.go, line 164 at r8 (raw file):

Previously, petermattis (Peter Mattis) wrote…

This will crash if len(cl) == 0.

fixed

pkg/storage/rule_solver.go, line 183 at r8 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Ditto crash mentioned above.

fixed

pkg/storage/rule_solver.go, line 234 at r8 (raw file):

Previously, petermattis (Peter Mattis) wrote…

s/criteria/required criteria/g?

Done.

pkg/storage/rule_solver.go, line 254 at r8 (raw file):

Previously, petermattis (Peter Mattis) wrote…

What does it mean for rebalanceToConvergesOnMean to affect constraintScore?

If we consider that we want to converge on toward mean when making allocation decisions, than it is a constraint (just not a formal constraint from our zone configs).

This allows us to use the criteria from shouldRebalance as part of ordering the candidates to make this decision in one place with full knowlege.

pkg/storage/rule_solver.go, line 439 at r8 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Does this need to return a value between 0 and 1 now? You're randomly selecting from the candidates with the same constraint score and then only comparing this value against itself.

It doesn't have to be between 0 and 1, but there's no reason not no. We want it to be higher for more empty replicas. It would be strange to order by constraint score descending and capacity score ascending.

Comments from Reviewable

petermattis · 2016-12-09T15:25:00Z

Review status: 0 of 4 files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.

pkg/storage/allocator.go, line 392 at r8 (raw file):

Previously, BramGruneir (Bram Gruneir) wrote…

fixed

I might be missing the fix, but it still looks like this will crash if len(existingCandidates) == 0.

pkg/storage/rule_solver.go, line 254 at r8 (raw file):

Previously, BramGruneir (Bram Gruneir) wrote…

If we consider that we want to converge on toward mean when making allocation decisions, than it is a constraint (just not a formal constraint from our zone configs).

This allows us to use the criteria from shouldRebalance as part of ordering the candidates to make this decision in one place with full knowlege.

Should this be rebalanceFromConvergesOnMean?

pkg/storage/rule_solver.go, line 439 at r8 (raw file):

Previously, BramGruneir (Bram Gruneir) wrote…

It doesn't have to be between 0 and 1, but there's no reason not no. We want it to be higher for more empty replicas. It would be strange to order by constraint score descending and capacity score ascending.

It's also a little strange to be forcing the score into the range 0 to 1. And in order to do that you need to use RangeCount+1. I don't particularly care. Pick your strangeness.

pkg/storage/rule_solver.go, line 169 at r11 (raw file):

	order := randGen.Perm(len(cl))
	randGen.Unlock()
	best := cl[order[0]]

s/cl[order[0]]/&cl[order[0]]/g so that return &best.store does not cause an unnecessary allocation.

pkg/storage/rule_solver.go, line 191 at r11 (raw file):

	order := randGen.Perm(len(cl))
	randGen.Unlock()
	worst := cl[order[0]]

See above comment about using a pointer for worst.

Comments from Reviewable

BramGruneir · 2016-12-12T05:01:28Z

Review status: 0 of 4 files reviewed at latest revision, 5 unresolved discussions, some commit checks pending.

pkg/storage/allocator.go, line 392 at r8 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I might be missing the fix, but it still looks like this will crash if len(existingCandidates) == 0.

Ah, I fixed a different one.
I've put in a fix for this, but it shouldn't happen. This requires calling rebalance on a replica in which no store is alive in the store pool, including this one.
Perhaps a panic here is a good idea.

pkg/storage/rule_solver.go, line 254 at r8 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Should this be rebalanceFromConvergesOnMean?

We actually want the opposite. SelectBad in the balancer calls that and uses it to identify preferred candidates for removal. In this case, we want the worse candidates to have a lower score.

I considered removing 1 form the constraint score, but I was worried about the interaction it might have with positive zone config constraints or diversity scores, both of which are always positive.

pkg/storage/rule_solver.go, line 439 at r8 (raw file):

Previously, petermattis (Peter Mattis) wrote…

It's also a little strange to be forcing the score into the range 0 to 1. And in order to do that you need to use RangeCount+1. I don't particularly care. Pick your strangeness.

I'd be down for changing this up (and the already strange sort reverseness), but in a follow up PR. If we do come up with a collection of metrics to determine sizes, I'd rather make them range from -1 to 1 which more closely matches ML style scores.

pkg/storage/rule_solver.go, line 169 at r11 (raw file):

Previously, petermattis (Peter Mattis) wrote…

s/cl[order[0]]/&cl[order[0]]/g so that return &best.store does not cause an unnecessary allocation.

Done.

pkg/storage/rule_solver.go, line 191 at r11 (raw file):

Previously, petermattis (Peter Mattis) wrote…

See above comment about using a pointer for worst.

Done.

Comments from Reviewable

petermattis · 2016-12-12T14:07:39Z

Review status: 0 of 4 files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.

pkg/storage/rule_solver.go, line 439 at r8 (raw file):
Ok on adjusting this in a follow-on PR.

which more closely matches ML style scores

Why is that important?

pkg/storage/rule_solver.go, line 260 at r14 (raw file):

		}
		constraintScore := diversityRemovalScore(s.Node.NodeID, existingNodeLocalities) + float64(preferredMatched)
		if rebalanceToConvergesOnMean(sl, s) {

Ah, so should this be rebalanceFromConvergesOnMean. It doesn't make sense to be using the same function in both removeCandidates and rebalanceCandidates.

Comments from Reviewable

- Splits the scores returned in the rule solver into a constraint and balance scores. - Add a valid field to constraints and add it to all rules. - Solve now returns all candidates instead of just the valid ones. To get only the valid candidates, the new function onlyValid and new type condidateList have also been added. - This allows us to use solve for removeTarget. It also cleans up the logic in removeTarget to more closely match the non-rule solver version. - Split the capcity rules into two rules. They were performing two different operations and didnt' make sense being combined. This will also ease the change of converting the rules to basic functions. Part of cockroachdb#10275

This commit adds the equivalent of the current selectGood and selectBad so that the rule solver will also use the "power of two random choices" method that is currently used by the balancer.

BramGruneir · 2016-12-12T20:21:37Z

Review status: 0 of 4 files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.

pkg/storage/rule_solver.go, line 439 at r8 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Ok on adjusting this in a follow-on PR.

which more closely matches ML style scores

Why is that important?

It's not, but since it's more or less a standard, we might be able to leverage some basic ML techniques if we have a number of criteria in need of weighting.
Note that this is not going to be needed anytime soon and hand tuned heuristics should be fine for a long time.

I'll send a follow up PR that cleans this up.

pkg/storage/rule_solver.go, line 260 at r14 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Ah, so should this be rebalanceFromConvergesOnMean. It doesn't make sense to be using the same function in both removeCandidates and rebalanceCandidates.

Ok, so I had to step away from this and re-examine exactly how this should work. I think I've got it. Please let me know if my logic is flawed here.

First of all, rebalanceFromConvergesOnMean != !rebalanceToConvergesOnMean which was an assumption I had made.

With that in mind:

Removals: What we actually want is !rebalanceFromConvergesOnMean as it matches the selectBad by promoting those that do converge on the mean.

Rebalance: For the existing candidates, we want !rebalanceFromConvergesOnMean exactly like the removals. While for new candidates, we want only stores that rebalanceToConvergesOnMean Which matches improve'. This keeps the operations in line with the original rebalancing. So for all candidates, we eliminate any that don't rebalanceToConvergeOnMean` and add 1.0 to the base constraint score to match the positive removal score.

Comments from Reviewable

petermattis · 2016-12-12T21:16:13Z

As you probably realize, this PR is a beast which makes it very likely something has slipped through. I don't see a good way to mitigate for this PR, but in the future smaller PRs are better: easier on the reviewer and better reviewed code for the author.

Review status: 0 of 4 files reviewed at latest revision, 4 unresolved discussions, all commit checks successful.

pkg/storage/rule_solver.go, line 439 at r8 (raw file):

we might be able to leverage some basic ML techniques if we have a number of criteria in need of weighting

That's way too speculative to use as a basis for a change. When we do have a need to weight a number of criteria we should revisit, but not before.

pkg/storage/rule_solver.go, line 260 at r17 (raw file):

		}
		constraintScore := diversityRemovalScore(s.Node.NodeID, existingNodeLocalities) + float64(preferredMatched)
		if !rebalanceFromConvergesOnMean(sl, s) {

This is somewhat confusing and deserves a comment: Removing a replica from the candidate converges on the mean which makes the candidate more attractive, blah blah blah regarding selecting worst candidate.

pkg/storage/rule_solver.go, line 312 at r17 (raw file):

			}
			constraintScore := diversityRemovalScore(s.Node.NodeID, existingNodeLocalities) + float64(preferredMatched)
			if !rebalanceFromConvergesOnMean(sl, s) {

Ditto my other comment. This one can be a reference to the comment in removeCandidate.

pkg/storage/rule_solver.go, line 325 at r17 (raw file):

				continue
			}
			constraintScore := 1.0 + diversityScore(s, existingNodeLocalities) + float64(preferredMatched)

Adding a constant 1.0 won't affect the ordering of candidates. Either explain what it is doing there or remove.

Comments from Reviewable

Remove the concept of rules and replace it with clean direct functions.

BramGruneir · 2016-12-12T22:43:45Z

Review status: 0 of 4 files reviewed at latest revision, 4 unresolved discussions, all commit checks successful.

pkg/storage/rule_solver.go, line 260 at r17 (raw file):

Previously, petermattis (Peter Mattis) wrote…

This is somewhat confusing and deserves a comment: Removing a replica from the candidate converges on the mean which makes the candidate more attractive, blah blah blah regarding selecting worst candidate.

Done.

pkg/storage/rule_solver.go, line 312 at r17 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Ditto my other comment. This one can be a reference to the comment in removeCandidate.

Done.

pkg/storage/rule_solver.go, line 325 at r17 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Adding a constant 1.0 won't affect the ordering of candidates. Either explain what it is doing there or remove.

Done.

Comments from Reviewable

BramGruneir requested a review from petermattis December 8, 2016 03:13

This was referenced Dec 8, 2016

storage: update rule solver candidates to have 2 scores and a valid bool #11702

Closed

storage: split ruleCapacity into two rules #11721

Closed

BramGruneir force-pushed the removerules branch from 7ca73d8 to 324f16f Compare December 8, 2016 04:36

BramGruneir force-pushed the removerules branch from 324f16f to a0c773d Compare December 9, 2016 04:29

BramGruneir force-pushed the removerules branch from a0c773d to b51aaef Compare December 12, 2016 04:54

BramGruneir added 2 commits December 12, 2016 11:30

storage: add a little randomness into the rule_solver

38d0626

This commit adds the equivalent of the current selectGood and selectBad so that the rule solver will also use the "power of two random choices" method that is currently used by the balancer.

BramGruneir force-pushed the removerules branch from b51aaef to b19ec35 Compare December 12, 2016 20:21

storage: replace the rule solver

bbe17fa

Remove the concept of rules and replace it with clean direct functions.

BramGruneir force-pushed the removerules branch from b19ec35 to bbe17fa Compare December 12, 2016 22:18

BramGruneir mentioned this pull request Dec 13, 2016

storage: rule solver post merge issues #10275

Closed

19 tasks

BramGruneir merged commit eeb5276 into cockroachdb:master Dec 13, 2016

BramGruneir deleted the removerules branch December 13, 2016 15:22

spencerkimball mentioned this pull request Mar 29, 2017

config: enhance ZoneConfig expressiveness #4868

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: replace the rule solver #12165

storage: replace the rule solver #12165

BramGruneir commented Dec 8, 2016 •

edited

Loading

petermattis commented Dec 8, 2016

BramGruneir commented Dec 9, 2016

petermattis commented Dec 9, 2016

BramGruneir commented Dec 12, 2016

petermattis commented Dec 12, 2016

BramGruneir commented Dec 12, 2016

petermattis commented Dec 12, 2016

BramGruneir commented Dec 12, 2016

storage: replace the rule solver #12165

storage: replace the rule solver #12165

Conversation

BramGruneir commented Dec 8, 2016 • edited Loading

petermattis commented Dec 8, 2016

BramGruneir commented Dec 9, 2016

petermattis commented Dec 9, 2016

BramGruneir commented Dec 12, 2016

petermattis commented Dec 12, 2016

BramGruneir commented Dec 12, 2016

petermattis commented Dec 12, 2016

BramGruneir commented Dec 12, 2016

BramGruneir commented Dec 8, 2016 •

edited

Loading