Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: replace the rule solver #12165

Merged
merged 3 commits into from
Dec 13, 2016

Conversation

BramGruneir
Copy link
Member

@BramGruneir BramGruneir commented Dec 8, 2016

The 1st commit is a collection of fixes and updates to the rule solver.

The 2nd commit adds back in the randomness that was missing from the rule solver.

The 3rd commit removes the rule solver altogether and replaces it with a collection of functions for the rules and 3 functions for allocation, removal and rebalancing.

There's more work to do here after merging, but this will bring us close enough to start testing rebalancing using this more expressive system.

With this PR #11702 and #11721 can be closed.


This change is Reviewable

@petermattis
Copy link
Collaborator

Review status: 0 of 4 files reviewed at latest revision, 13 unresolved discussions, all commit checks successful.


pkg/storage/allocator.go, line 362 at r8 (raw file):

	if a.options.UseRuleSolver {
		// TODO(bram): ShouldRebalance should be part of rebalanceCandidates
		// and decision made afterward, not it's own function.

You probably have a reason in mind. Add it to the comment.


pkg/storage/allocator.go, line 392 at r8 (raw file):

		// Find all candidates that are better than the worst existing replica.
		targets := candidates.betterThan(existingCandidates[len(existingCandidates)-1])

This will crash if len(existingCandidates) == 0 for some reason.


pkg/storage/rule_solver.go, line 33 at r8 (raw file):

	valid      bool
	constraint float64 // Score used to pick the top candidates.
	balance    float64 // Score used to choose between top candidates.

s/balance/capacity/g to reflect that this is always initialized to capacityScore.


pkg/storage/rule_solver.go, line 37 at r8 (raw file):

func (c candidate) String() string {
	return fmt.Sprintf("StoreID:%d, valid:%t, con:%.2f, bal:%.2f",

s/StoreID:%d/s%d to match the format used elsewhere.


pkg/storage/rule_solver.go, line 96 at r8 (raw file):

func (c byScoreAndID) Swap(i, j int) { c[i], c[j] = c[j], c[i] }

// onlyValid returns all the elements in a sorted candidate list that are valid.

You should add a comment to all of these candidateList methods that they require the candidates to be sorted (and which sorting is required).


pkg/storage/rule_solver.go, line 115 at r8 (raw file):

	for i := 1; i < len(cl); i++ {
		if cl[i].constraint < cl[0].constraint {
			return cl[0:i]

s/0:i/i/g


pkg/storage/rule_solver.go, line 127 at r8 (raw file):

		return cl
	}
	// Are there invalid values? If so, pick those.

s/values/candidates/g


pkg/storage/rule_solver.go, line 156 at r8 (raw file):

// selectGood randomly chooses a good candidate from a sorted candidate list
// using the provided random generator.
func (cl candidateList) selectGood(randGen allocatorRand) candidate {

Return a *candidate so that you don't require the caller to perform an allocation.


pkg/storage/rule_solver.go, line 164 at r8 (raw file):

	order := randGen.Perm(len(cl))
	randGen.Unlock()
	best := cl[order[0]]

This will crash if len(cl) == 0.


pkg/storage/rule_solver.go, line 183 at r8 (raw file):

	order := randGen.Perm(len(cl))
	randGen.Unlock()
	worst := cl[order[0]]

Ditto crash mentioned above.


pkg/storage/rule_solver.go, line 234 at r8 (raw file):

// removeCandidates creates a candidate list of all existing replicas' stores
// ordered from least qualified for removal to most qualified. Stores that are
// marked as not valid, are in violation of a criteria.

s/criteria/required criteria/g?


pkg/storage/rule_solver.go, line 254 at r8 (raw file):

		constraintScore := diversityRemovalScore(s.Node.NodeID, existingNodeLocalities) + float64(preferredMatched)
		if rebalanceToConvergesOnMean(sl, s) {
			constraintScore++

What does it mean for rebalanceToConvergesOnMean to affect constraintScore?


pkg/storage/rule_solver.go, line 439 at r8 (raw file):

// the highest scores.
func capacityScore(store roachpb.StoreDescriptor) float64 {
	return 1.0 / float64(store.Capacity.RangeCount+1)

Does this need to return a value between 0 and 1 now? You're randomly selecting from the candidates with the same constraint score and then only comparing this value against itself.


Comments from Reviewable

@BramGruneir
Copy link
Member Author

Review status: 0 of 4 files reviewed at latest revision, 13 unresolved discussions, some commit checks failed.


pkg/storage/allocator.go, line 362 at r8 (raw file):

Previously, petermattis (Peter Mattis) wrote…

You probably have a reason in mind. Add it to the comment.

Done.


pkg/storage/allocator.go, line 392 at r8 (raw file):

Previously, petermattis (Peter Mattis) wrote…

This will crash if len(existingCandidates) == 0 for some reason.

fixed


pkg/storage/rule_solver.go, line 33 at r8 (raw file):

Previously, petermattis (Peter Mattis) wrote…

s/balance/capacity/g to reflect that this is always initialized to capacityScore.

Done.


pkg/storage/rule_solver.go, line 37 at r8 (raw file):

Previously, petermattis (Peter Mattis) wrote…

s/StoreID:%d/s%d to match the format used elsewhere.

Done.


pkg/storage/rule_solver.go, line 96 at r8 (raw file):

Previously, petermattis (Peter Mattis) wrote…

You should add a comment to all of these candidateList methods that they require the candidates to be sorted (and which sorting is required).

They already said "sorted", but I've added a "(by score reversed)" to them.


pkg/storage/rule_solver.go, line 115 at r8 (raw file):

Previously, petermattis (Peter Mattis) wrote…

s/0:i/i/g

you meant :i right? If so, done.


pkg/storage/rule_solver.go, line 127 at r8 (raw file):

Previously, petermattis (Peter Mattis) wrote…

s/values/candidates/g

Done.


pkg/storage/rule_solver.go, line 156 at r8 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Return a *candidate so that you don't require the caller to perform an allocation.

Done. Actually, changed this and selectBad to return a pointer to a storeDescriptor.


pkg/storage/rule_solver.go, line 164 at r8 (raw file):

Previously, petermattis (Peter Mattis) wrote…

This will crash if len(cl) == 0.

fixed


pkg/storage/rule_solver.go, line 183 at r8 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Ditto crash mentioned above.

fixed


pkg/storage/rule_solver.go, line 234 at r8 (raw file):

Previously, petermattis (Peter Mattis) wrote…

s/criteria/required criteria/g?

Done.


pkg/storage/rule_solver.go, line 254 at r8 (raw file):

Previously, petermattis (Peter Mattis) wrote…

What does it mean for rebalanceToConvergesOnMean to affect constraintScore?

If we consider that we want to converge on toward mean when making allocation decisions, than it is a constraint (just not a formal constraint from our zone configs).

This allows us to use the criteria from shouldRebalance as part of ordering the candidates to make this decision in one place with full knowlege.


pkg/storage/rule_solver.go, line 439 at r8 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Does this need to return a value between 0 and 1 now? You're randomly selecting from the candidates with the same constraint score and then only comparing this value against itself.

It doesn't have to be between 0 and 1, but there's no reason not no. We want it to be higher for more empty replicas. It would be strange to order by constraint score descending and capacity score ascending.


Comments from Reviewable

@petermattis
Copy link
Collaborator

Review status: 0 of 4 files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.


pkg/storage/allocator.go, line 392 at r8 (raw file):

Previously, BramGruneir (Bram Gruneir) wrote…

fixed

I might be missing the fix, but it still looks like this will crash if len(existingCandidates) == 0.


pkg/storage/rule_solver.go, line 254 at r8 (raw file):

Previously, BramGruneir (Bram Gruneir) wrote…

If we consider that we want to converge on toward mean when making allocation decisions, than it is a constraint (just not a formal constraint from our zone configs).

This allows us to use the criteria from shouldRebalance as part of ordering the candidates to make this decision in one place with full knowlege.

Should this be rebalanceFromConvergesOnMean?


pkg/storage/rule_solver.go, line 439 at r8 (raw file):

Previously, BramGruneir (Bram Gruneir) wrote…

It doesn't have to be between 0 and 1, but there's no reason not no. We want it to be higher for more empty replicas. It would be strange to order by constraint score descending and capacity score ascending.

It's also a little strange to be forcing the score into the range 0 to 1. And in order to do that you need to use RangeCount+1. I don't particularly care. Pick your strangeness.


pkg/storage/rule_solver.go, line 169 at r11 (raw file):

	order := randGen.Perm(len(cl))
	randGen.Unlock()
	best := cl[order[0]]

s/cl[order[0]]/&cl[order[0]]/g so that return &best.store does not cause an unnecessary allocation.


pkg/storage/rule_solver.go, line 191 at r11 (raw file):

	order := randGen.Perm(len(cl))
	randGen.Unlock()
	worst := cl[order[0]]

See above comment about using a pointer for worst.


Comments from Reviewable

@BramGruneir
Copy link
Member Author

Review status: 0 of 4 files reviewed at latest revision, 5 unresolved discussions, some commit checks pending.


pkg/storage/allocator.go, line 392 at r8 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I might be missing the fix, but it still looks like this will crash if len(existingCandidates) == 0.

Ah, I fixed a different one.
I've put in a fix for this, but it shouldn't happen. This requires calling rebalance on a replica in which no store is alive in the store pool, including this one.
Perhaps a panic here is a good idea.


pkg/storage/rule_solver.go, line 254 at r8 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Should this be rebalanceFromConvergesOnMean?

We actually want the opposite. SelectBad in the balancer calls that and uses it to identify preferred candidates for removal. In this case, we want the worse candidates to have a lower score.

I considered removing 1 form the constraint score, but I was worried about the interaction it might have with positive zone config constraints or diversity scores, both of which are always positive.


pkg/storage/rule_solver.go, line 439 at r8 (raw file):

Previously, petermattis (Peter Mattis) wrote…

It's also a little strange to be forcing the score into the range 0 to 1. And in order to do that you need to use RangeCount+1. I don't particularly care. Pick your strangeness.

I'd be down for changing this up (and the already strange sort reverseness), but in a follow up PR. If we do come up with a collection of metrics to determine sizes, I'd rather make them range from -1 to 1 which more closely matches ML style scores.


pkg/storage/rule_solver.go, line 169 at r11 (raw file):

Previously, petermattis (Peter Mattis) wrote…

s/cl[order[0]]/&cl[order[0]]/g so that return &best.store does not cause an unnecessary allocation.

Done.


pkg/storage/rule_solver.go, line 191 at r11 (raw file):

Previously, petermattis (Peter Mattis) wrote…

See above comment about using a pointer for worst.

Done.


Comments from Reviewable

@petermattis
Copy link
Collaborator

Review status: 0 of 4 files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.


pkg/storage/rule_solver.go, line 439 at r8 (raw file):
Ok on adjusting this in a follow-on PR.

which more closely matches ML style scores

Why is that important?


pkg/storage/rule_solver.go, line 260 at r14 (raw file):

		}
		constraintScore := diversityRemovalScore(s.Node.NodeID, existingNodeLocalities) + float64(preferredMatched)
		if rebalanceToConvergesOnMean(sl, s) {

Ah, so should this be rebalanceFromConvergesOnMean. It doesn't make sense to be using the same function in both removeCandidates and rebalanceCandidates.


Comments from Reviewable

- Splits the scores returned in the rule solver into a constraint and balance scores.
- Add a valid field to constraints and add it to all rules.
- Solve now returns all candidates instead of just the valid ones. To get only the valid candidates, the new function onlyValid and new type  condidateList have also been added.
- This allows us to use solve for removeTarget. It also cleans up the logic in removeTarget to more closely match the non-rule solver version.
- Split the capcity rules into two rules. They were performing two different operations and didnt' make sense being combined. This will also ease the change of converting the rules to basic functions.

Part of cockroachdb#10275
This commit adds the equivalent of the current selectGood and selectBad so that the rule solver will also use the "power of two random choices" method that is currently used by the balancer.
@BramGruneir
Copy link
Member Author

Review status: 0 of 4 files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.


pkg/storage/rule_solver.go, line 439 at r8 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Ok on adjusting this in a follow-on PR.

which more closely matches ML style scores

Why is that important?

It's not, but since it's more or less a standard, we might be able to leverage some basic ML techniques if we have a number of criteria in need of weighting.
Note that this is not going to be needed anytime soon and hand tuned heuristics should be fine for a long time.

I'll send a follow up PR that cleans this up.


pkg/storage/rule_solver.go, line 260 at r14 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Ah, so should this be rebalanceFromConvergesOnMean. It doesn't make sense to be using the same function in both removeCandidates and rebalanceCandidates.

Ok, so I had to step away from this and re-examine exactly how this should work. I think I've got it. Please let me know if my logic is flawed here.

First of all, rebalanceFromConvergesOnMean != !rebalanceToConvergesOnMean which was an assumption I had made.

With that in mind:

Removals: What we actually want is !rebalanceFromConvergesOnMean as it matches the selectBad by promoting those that do converge on the mean.

Rebalance: For the existing candidates, we want !rebalanceFromConvergesOnMean exactly like the removals. While for new candidates, we want only stores that rebalanceToConvergesOnMean Which matches improve'. This keeps the operations in line with the original rebalancing. So for all candidates, we eliminate any that don't rebalanceToConvergeOnMean` and add 1.0 to the base constraint score to match the positive removal score.


Comments from Reviewable

@petermattis
Copy link
Collaborator

As you probably realize, this PR is a beast which makes it very likely something has slipped through. I don't see a good way to mitigate for this PR, but in the future smaller PRs are better: easier on the reviewer and better reviewed code for the author.

:lgtm:


Review status: 0 of 4 files reviewed at latest revision, 4 unresolved discussions, all commit checks successful.


pkg/storage/rule_solver.go, line 439 at r8 (raw file):

we might be able to leverage some basic ML techniques if we have a number of criteria in need of weighting

That's way too speculative to use as a basis for a change. When we do have a need to weight a number of criteria we should revisit, but not before.


pkg/storage/rule_solver.go, line 260 at r17 (raw file):

		}
		constraintScore := diversityRemovalScore(s.Node.NodeID, existingNodeLocalities) + float64(preferredMatched)
		if !rebalanceFromConvergesOnMean(sl, s) {

This is somewhat confusing and deserves a comment: Removing a replica from the candidate converges on the mean which makes the candidate more attractive, blah blah blah regarding selecting worst candidate.


pkg/storage/rule_solver.go, line 312 at r17 (raw file):

			}
			constraintScore := diversityRemovalScore(s.Node.NodeID, existingNodeLocalities) + float64(preferredMatched)
			if !rebalanceFromConvergesOnMean(sl, s) {

Ditto my other comment. This one can be a reference to the comment in removeCandidate.


pkg/storage/rule_solver.go, line 325 at r17 (raw file):

				continue
			}
			constraintScore := 1.0 + diversityScore(s, existingNodeLocalities) + float64(preferredMatched)

Adding a constant 1.0 won't affect the ordering of candidates. Either explain what it is doing there or remove.


Comments from Reviewable

Remove the concept of rules and replace it with clean direct functions.
@BramGruneir
Copy link
Member Author

Review status: 0 of 4 files reviewed at latest revision, 4 unresolved discussions, all commit checks successful.


pkg/storage/rule_solver.go, line 260 at r17 (raw file):

Previously, petermattis (Peter Mattis) wrote…

This is somewhat confusing and deserves a comment: Removing a replica from the candidate converges on the mean which makes the candidate more attractive, blah blah blah regarding selecting worst candidate.

Done.


pkg/storage/rule_solver.go, line 312 at r17 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Ditto my other comment. This one can be a reference to the comment in removeCandidate.

Done.


pkg/storage/rule_solver.go, line 325 at r17 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Adding a constant 1.0 won't affect the ordering of candidates. Either explain what it is doing there or remove.

Done.


Comments from Reviewable

@BramGruneir BramGruneir merged commit eeb5276 into cockroachdb:master Dec 13, 2016
@BramGruneir BramGruneir deleted the removerules branch December 13, 2016 15:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants