storage: Re-apply the rule solver. #10507

BramGruneir · 2016-11-07T16:14:15Z

Instead of applying 1ef40f3 or #10252, this finishes the
reapplication of the rule solver. However, this also puts the rule solver under
the environment flag COCKROACH_ENABLE_RULE_SOLVER for ease of testing and defaults
to not enabled.

The follow up to this commit is #10275 and a lot of testing to ensure that the
rule solver does indeed perform as expected.

Closes #9336

This change is

tamird · 2016-11-07T16:19:59Z

The commit message includes the string "1ef40f323b94005dd84378669da8296c9fec8e7cor " which probably should have been "1ef40f323b94005dd84378669da8296c9fec8e7 or".

tamird · 2016-11-07T16:20:30Z

On second look, 1ef40f3 is not a SHA that exists in this repo, so I don't really know what that string was intended to be.

tamird · 2016-11-07T16:21:00Z

Ah "1ef40f323b94005dd84378669da8296c9fec8e7c or"

BramGruneir · 2016-11-07T16:22:17Z

ok, tried to clean that up

petermattis · 2016-11-07T16:23:40Z

though I didn't look at the rule_solver{,_test}.go code.

Review status: 0 of 7 files reviewed at latest revision, 4 unresolved discussions, some commit checks pending.

pkg/storage/allocator.go, line 130 at r1 (raw file):

  // UseRuleSolver enables this store to use the updated rules based
  // constraint solver instead of the original rebalancer.
  UseRuleSolver bool

Is there any need for this field? It looks like this is only set to the value of the enableRuleSolver global.

pkg/storage/replicate_queue.go, line 129 at r1 (raw file):

  )
  if err != nil {
      return false, 0

Should the error be logged?

pkg/storage/replicate_queue.go, line 229 at r1 (raw file):

      )
      if rebalanceStore == nil || err != nil {
          log.VEventf(ctx, 1, "no suitable rebalance target")

Include the error via a %v format.

pkg/storage/store.go, line 105 at r1 (raw file):

  "COCKROACH_ENABLE_PREVOTE", false)

var enableRuleSolver = envutil.EnvOrDefaultBool("COCKROACH_ENABLE_RULE_SOLVER", false)

I'd prefer to see this in allocator.go, next to its usage.

Comments from Reviewable

tamird · 2016-11-07T16:24:46Z

Reviewed 1 of 7 files at r1.
Review status: 1 of 7 files reviewed at latest revision, 5 unresolved discussions, some commit checks pending.

pkg/storage/allocator.go, line 130 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Is there any need for this field? It looks like this is only set to the value of the enableRuleSolver global.

It's used in new tests; reviewable is hiding that diff by default.

pkg/storage/store.go, line 105 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I'd prefer to see this in allocator.go, next to its usage.

Technically its usage is here, not there.

pkg/storage/simulation/range.go, line 151 at r1 (raw file):

      r.desc.RangeID,
  )
  if err != nil {

Seems like you need to plumb an error out of this function.

Comments from Reviewable

petermattis · 2016-11-07T16:26:45Z

Review status: 1 of 7 files reviewed at latest revision, 5 unresolved discussions, some commit checks pending.

pkg/storage/allocator.go, line 130 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

It's used in new tests; reviewable is hiding that diff by default.

What tests? Took another look and I'm still not seeing that usage.

pkg/storage/simulation/range.go, line 151 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

Seems like you need to plumb an error out of this function.

This is the allocation simulator. Doesn't seem worth changing.

Comments from Reviewable

tamird · 2016-11-07T16:30:11Z

Reviewed 2 of 7 files at r1.
Review status: 3 of 7 files reviewed at latest revision, 8 unresolved discussions, some commit checks pending.

pkg/storage/allocator.go, line 130 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

What tests? Took another look and I'm still not seeing that usage.

allocator_test.go

pkg/storage/allocator.go, line 145 at r1 (raw file):

func MakeAllocator(storePool *StorePool, options AllocatorOptions) Allocator {
  var randSource rand.Source
  // There are number of test cases that make a test store but don't add

can you add TODOs to those tests, or is it essential that they not construct a gossip or store pool?

pkg/storage/replicate_queue.go, line 129 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Should the error be logged?

+1, log.ErrEvent probably

pkg/storage/replicate_queue.go, line 229 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Include the error via a %v format.

Prefer something a bit more involved:

if err != nil {
  log.ErrEvent(ctx, err.Error()) // or log.ErrEventf(ctx, "something something %s", err)
  return nil
}
// unchanged logic below
if rebalanceStore == nil {
  ...
  return nil
}

pkg/storage/rule_solver.go, line 53 at r1 (raw file):

// defaultRules is the default rule set to use.
var defaultRules = []rule{

this variable is package-scoped, not file-scoped. let's give it a name that is more reflective of where it's meant to be used, like defaultSolverRules.

pkg/storage/rule_solver.go, line 73 at r1 (raw file):

// makeDefaultRuleSolver returns a ruleSolver with defaultRules.
func makeDefaultRuleSolver() ruleSolver {

remove this method? not sure i see the point.

Comments from Reviewable

petermattis · 2016-11-07T16:36:16Z

Review status: 3 of 7 files reviewed at latest revision, 9 unresolved discussions, some commit checks failed.

pkg/storage/allocator.go, line 130 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

allocator_test.go

Wtf? Why was reviewable hiding that from me?

pkg/storage/allocator_test.go, line 523 at r1 (raw file):

          },
              []int{}, 1, false,
          },

There is a lot of boilerplate to these test cases which makes it difficult to read. I suggest creating some helper functions (local to this test).

Comments from Reviewable

tamird · 2016-11-07T17:31:02Z

It would also be good to amend the commit message to be free-standing. That is, rather than only referring to other commits/PRs, include directly the change that is being introduced here.

Reviewed 4 of 7 files at r1.
Review status: all files reviewed at latest revision, 46 unresolved discussions, some commit checks failed.

pkg/storage/allocator.go, line 130 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Wtf? Why was reviewable hiding that from me?

"Diff has 1008 lines. That will slow things down." it's just trying to be helpful.

pkg/storage/allocator.go, line 296 at r1 (raw file):

      var worst roachpb.ReplicaDescriptor
      worstScore := math.MaxFloat64

nit: worstScore := math.Inf(0)

pkg/storage/allocator.go, line 305 at r1 (raw file):

              continue
          }
          // If it's not a valid candidate, score will be zero.

Well, that's odd. Doesn't computeCandidate return a boolean (which you're ignoring here) expressly for this distinction? if you intend to use the zero value as "not valid" that's ok too, but then computeCandidate should not return a boolean.

pkg/storage/allocator.go, line 320 at r1 (raw file):

      }

      if worstScore < math.MaxFloat64 {

if math.IsInf(worstScore, 0) {?

pkg/storage/allocator.go, line 324 at r1 (raw file):

      }

      return roachpb.ReplicaDescriptor{}, errors.Errorf("RemoveTarget() could not select an appropriate replica to be remove")

errors.New, but also this error should not include the method name IMO; the caller should/would attach that using errors.Wrapf. This is just my intuition of the convention, so I would accept a counter-argument.

pkg/storage/allocator.go, line 400 at r1 (raw file):

      }

      // Load the exiting storesIDs into a map.

not a helpful comment. perhaps explain why this done instead.

pkg/storage/allocator.go, line 406 at r1 (raw file):

      }

      // Split the store list into existing and candidate stores lists.

not a helpful comment. perhaps explain why this done instead.

pkg/storage/allocator.go, line 431 at r1 (raw file):

      // Find all candidates that are better than the worst existing store.
      var worstCandidateStore float64
      // If any store from existing is not included in existingCandidates, it was

this sentence mixes past and present tenses. s/was/is/

pkg/storage/allocator.go, line 432 at r1 (raw file):

      var worstCandidateStore float64
      // If any store from existing is not included in existingCandidates, it was
      // because it no longer meets the Constraints. So its score would be 0.

s/Constrains. S/constrains, s/

pkg/storage/allocator.go, line 440 at r1 (raw file):

      for _, cand := range candidates {
          if cand.score > worstCandidateStore {
              return &(candidates[0].store), nil

these parentheses are superfluous

pkg/storage/allocator.go, line 507 at r1 (raw file):

      // TODO(peter,bram,cuong): The FractionUsed check seems suspicious. When a
      // node becomes fuller than maxFractionUsedThreshold we will always select it
      // for rebalancing. This is currently utilized by tests.

which tests?

pkg/storage/allocator.go, line 510 at r1 (raw file):

      maxCapacityUsed := store.Capacity.FractionUsed() >= maxFractionUsedThreshold

      // Rebalance if we're above the rebalance target, which is

are you duplicating the logic of rebalanceFromConvergesOnMean and/or rebalanceToConvergesOnMean here?

pkg/storage/allocator_test.go, line 202 at r1 (raw file):

      detail.desc = &roachpb.StoreDescriptor{
          StoreID: storeID,
          Node:    roachpb.NodeDescriptor{NodeID: roachpb.NodeID(storeID)},

why is the node descriptor needed now? here and below.

pkg/storage/allocator_test.go, line 250 at r1 (raw file):

  }

  t.Run("without rule solver", func(t *testing.T) {

This boilerplate is everywhere. How about extracting a runToggleRuleSolver (or perhaps a better name) helper? Ideally, you could also eliminate the named test closures introduced everywhere as well.

pkg/storage/allocator_test.go, line 523 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

There is a lot of boilerplate to these test cases which makes it difficult to read. I suggest creating some helper functions (local to this test).

+1

pkg/storage/allocator_test.go, line 1750 at r1 (raw file):

}

func Example_rebalancingWithRuleSolver() {

Can this share code with Example_rebalancing? The two also seem to produce different outcomes, which is worrying.

pkg/storage/rule_solver.go, line 44 at r1 (raw file):

// rule is a generic rule that can be used to solve a constraint problem.
// Returning false will remove the store from the list of candidate stores. The

this is a comment for run, not rule

pkg/storage/rule_solver.go, line 78 at r1 (raw file):

// makeRuleSolver makes a new ruleSolver. The order of the rules is the order in
// which they are run. For optimization purposes, less computationally intense

s/intense/expensive/

pkg/storage/rule_solver.go, line 91 at r1 (raw file):

}

// solve given constraints and return the score.

this comment is completely incorrect.

pkg/storage/rule_solver.go, line 114 at r1 (raw file):

}

// computeCandidate runs all the rules for the store and returns the candidacy

s/store/state/. what is "candidacy information"?

pkg/storage/rule_solver.go, line 123 at r1 (raw file):

          return candidate{}, false
      }
      if !math.IsNaN(score) {

what is this?

pkg/storage/rule_solver.go, line 130 at r1 (raw file):

}

// ruleReplicasUniqueNodes ensures that no two replicas are put on the same

please rewrite this comment in terms of arguments and return values. "ensures" implies side effects, which this does not have.

pkg/storage/rule_solver.go, line 144 at r1 (raw file):

// matches the key value pair in the constraint.
func storeHasConstraint(store roachpb.StoreDescriptor, c config.Constraint) bool {
  var found bool

this variable is useless

pkg/storage/rule_solver.go, line 163 at r1 (raw file):

}

// ruleConstraints enforces that required and prohibited constraints are

again, this doesn't "enforce" anything, it has an argument and return values.

pkg/storage/rule_solver.go, line 169 at r1 (raw file):

  for _, c := range state.constraints.Constraints {
      hasConstraint := storeHasConstraint(state.store, c)
      switch {

switch c.Type?

pkg/storage/rule_solver.go, line 183 at r1 (raw file):

// ruleDiversity ensures that nodes that have the fewest locality tiers in
// common are given higher priority.

again, this doesn't "ensure" anything, it has an argument and return values.

pkg/storage/rule_solver.go, line 204 at r1 (raw file):

}

// ruleCapacity prioritizes placing data on empty nodes when the choice is

ditto

pkg/storage/rule_solver.go, line 215 at r1 (raw file):

}

// canonicalTierOrder returns the most common key at each tier level.

how is it organized? this returns a slice, so that's not at all clear from this comment.

pkg/storage/rule_solver.go, line 232 at r1 (raw file):

      for _, store := range sl.stores {
          key := ""
          if i < len(store.Node.Locality.Tiers) {

seems like there's nothing to do if this condition is false; the empty key is never used:

if i < len(store.Node.Locality.Tiers) {
  key := store.Node.Locality.Tiers[i].Key
  counts[key]++
  if counts[key] > counts[maxKey] {
    maxKey = key
  }
}

In other words, this currently counts the occurrences of the empty key unnecessarily.

pkg/storage/rule_solver.go, line 248 at r1 (raw file):

}

// storeTierMap indexes a store list so you can look up the locality tier

the tone of this comment is inconsistent with our style. In general, we avoid pronouns other than "we".

pkg/storage/rule_solver.go, line 262 at r1 (raw file):

}

// byScore implements sort.Interface for candidate slices.

"for candidate slices" is clear from the type. what's not clear (and surprising) is that this is a descending sort.

pkg/storage/rule_solver_test.go, line 40 at r1 (raw file):

func (c byScoreAndID) Swap(i, j int) { c[i], c[j] = c[j], c[i] }

// TestRuleSolver tests the mechanics of ruleSolver.

useless comment

pkg/storage/rule_solver_test.go, line 48 at r1 (raw file):

  )
  defer stopper.Stop()
  // 3 alive replicas, 1 dead

am i misunderstanding this comment? seems like there are 4 "alive" replicas and one dead. However, this comment is also not helpful.

pkg/storage/rule_solver_test.go, line 52 at r1 (raw file):

  storePool.mu.Lock()
  storePool.mu.storeDetails[1].desc.Attrs.Attrs = []string{"a"}

there are a lot of magic strings and integers here. either sprinkle comments, or extract some local variables, or both.

pkg/storage/rule_solver_test.go, line 103 at r1 (raw file):

      expected []roachpb.StoreID
  }{
      // No constraints or rules.

optional: you could make these comments into a "name" field on the testCase struct above, and then use that as the name of a sub-test for each case.

pkg/storage/rule_solver_test.go, line 238 at r1 (raw file):

      sort.Sort(byScoreAndID(candidates))
      if len(candidates) != len(tc.expected) {
          t.Errorf("%d: length of %+v should match %+v", i, candidates, tc.expected)

if you use sub-tests, remember to remove the integer from this message and the one below.

pkg/storage/rule_solver_test.go, line 253 at r1 (raw file):

  testCases := []struct {
      stores [][]roachpb.Tier

"stores"? that's not an OK name.

pkg/storage/rule_solver_test.go, line 326 at r1 (raw file):

      sl := StoreList{}
      for _, tiers := range tc.stores {
          sl.stores = append(sl.stores, roachpb.StoreDescriptor{

this is not a legal way to construct a store list in general, and we shouldn't use it.

pkg/storage/rule_solver_test.go, line 334 at r1 (raw file):

      if out := canonicalTierOrder(sl); !reflect.DeepEqual(out, tc.want) {
          t.Errorf("%d: canonicalTierOrder(%+v) = %+v; not %+v", i, tc.stores, out, tc.want)

same comment about descriptive names for sub-tests (currently the test cases are inscrutable) and removing the index if you do so.

Comments from Reviewable

BramGruneir · 2016-11-09T05:38:24Z

Addressed all the comments and added more info the commit message.

Review status: 1 of 8 files reviewed at latest revision, 46 unresolved discussions, some commit checks pending.

pkg/storage/allocator.go, line 145 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

can you add TODOs to those tests, or is it essential that they not construct a gossip or store pool?

They call store.New, but don't need to do anything with the allocator or any networking and adding a gossip and store pool will just bloat the tests needlessly. I'm going to leave them as is.

pkg/storage/allocator.go, line 296 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

nit: worstScore := math.Inf(0)

Done.

pkg/storage/allocator.go, line 305 at r1 (raw file):
The bool result is irrelevant here, what we're looking for is the worst of the candidate.

As mentioned in the last review of this code:

That's a bool whether or not the that store is a potential candidate. If it's not a candidate, the score will be zero and thus will be the worst. Just makes the logic simpler.

But since this is the second time you've wondered about this code, I'm just going to explicitly check the bool to make it clearer.

pkg/storage/allocator.go, line 320 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

if math.IsInf(worstScore, 0) {?

Done.

pkg/storage/allocator.go, line 324 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

errors.New, but also this error should not include the method name IMO; the caller should/would attach that using errors.Wrapf. This is just my intuition of the convention, so I would accept a counter-argument.

Done. Fixed both instances (with and without rule solver)

pkg/storage/allocator.go, line 400 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

not a helpful comment. perhaps explain why this done instead.

Done.

pkg/storage/allocator.go, line 406 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

not a helpful comment. perhaps explain why this done instead.

Done.

pkg/storage/allocator.go, line 431 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

this sentence mixes past and present tenses. s/was/is/

Done.

pkg/storage/allocator.go, line 432 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

s/Constrains. S/constrains, s/

Done.

pkg/storage/allocator.go, line 440 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

these parentheses are superfluous

Done.

pkg/storage/allocator.go, line 507 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

which tests?

This is an old and odd comment. I'm not entirely sure what it's referencing. Obviously we have rebalancing tests.

pkg/storage/allocator.go, line 510 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

are you duplicating the logic of rebalanceFromConvergesOnMean and/or rebalanceToConvergesOnMean here?

Entirely. Cleaned up by moving the rebalancing logic into the allocator directly. There was no need for the rangeCountBalancer

pkg/storage/allocator_test.go, line 202 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

why is the node descriptor needed now? here and below.

The ruleReplicasUniqueNodes needs it.

pkg/storage/allocator_test.go, line 250 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

This boilerplate is everywhere. How about extracting a runToggleRuleSolver (or perhaps a better name) helper? Ideally, you could also eliminate the named test closures introduced everywhere as well.

Done.

pkg/storage/allocator_test.go, line 523 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

+1

spent some time on this, reordered, named and cleaned up all the tests.

pkg/storage/allocator_test.go, line 1750 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

Can this share code with Example_rebalancing? The two also seem to produce different outcomes, which is worrying.

Sure, I'll extract out the shared code. But of course they'll produce different results, they're using a different function when choosing when to rebalance. What matters here is that the new output looks good.

pkg/storage/replicate_queue.go, line 129 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

+1, log.ErrEvent probably

Done.

pkg/storage/replicate_queue.go, line 229 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

Prefer something a bit more involved:

if err != nil {
  log.ErrEvent(ctx, err.Error()) // or log.ErrEventf(ctx, "something something %s", err)
  return nil
}
// unchanged logic below
if rebalanceStore == nil {
  ...
  return nil
}

Done.

pkg/storage/rule_solver.go, line 44 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

this is a comment for run, not rule

Done.

pkg/storage/rule_solver.go, line 53 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

this variable is package-scoped, not file-scoped. let's give it a name that is more reflective of where it's meant to be used, like defaultSolverRules.

Done.

pkg/storage/rule_solver.go, line 73 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

remove this method? not sure i see the point.

Going to leave this for now, it's in the follow up issue list already.

pkg/storage/rule_solver.go, line 78 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

s/intense/expensive/

Done.

pkg/storage/rule_solver.go, line 91 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

this comment is completely incorrect.

Done.

pkg/storage/rule_solver.go, line 114 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

s/store/state/. what is "candidacy information"?

Done.

pkg/storage/rule_solver.go, line 123 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

what is this?

Wow. Tracked this down. It's due to not handling divided by zero errors in the rules correctly. That was some lazy coding. Fixed.

pkg/storage/rule_solver.go, line 130 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

please rewrite this comment in terms of arguments and return values. "ensures" implies side effects, which this does not have.

Done.

pkg/storage/rule_solver.go, line 144 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

this variable is useless

Done.

pkg/storage/rule_solver.go, line 163 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

again, this doesn't "enforce" anything, it has an argument and return values.

Done.

pkg/storage/rule_solver.go, line 169 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

switch c.Type?

not with the anded bools

pkg/storage/rule_solver.go, line 183 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

again, this doesn't "ensure" anything, it has an argument and return values.

Done.

pkg/storage/rule_solver.go, line 204 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

ditto

Done.

pkg/storage/rule_solver.go, line 215 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

how is it organized? this returns a slice, so that's not at all clear from this comment.

Done.

pkg/storage/rule_solver.go, line 232 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

seems like there's nothing to do if this condition is false; the empty key is never used:
if i < len(store.Node.Locality.Tiers) {
  key := store.Node.Locality.Tiers[i].Key
  counts[key]++
  if counts[key] > counts[maxKey] {
    maxKey = key
  }
}
In other words, this currently counts the occurrences of the empty key unnecessarily.

That's by design. If most of the nodes don't have a tier, we don't add it. But we can short circuit the loop if that's the case.

pkg/storage/rule_solver.go, line 248 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

the tone of this comment is inconsistent with our style. In general, we avoid pronouns other than "we".

Done.

pkg/storage/rule_solver.go, line 262 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

"for candidate slices" is clear from the type. what's not clear (and surprising) is that this is a descending sort.

Done.

pkg/storage/rule_solver_test.go, line 40 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

useless comment

Done.

pkg/storage/rule_solver_test.go, line 48 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

am i misunderstanding this comment? seems like there are 4 "alive" replicas and one dead. However, this comment is also not helpful.

nope, you're right, the comment was wrong.

pkg/storage/rule_solver_test.go, line 52 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

there are a lot of magic strings and integers here. either sprinkle comments, or extract some local variables, or both.

tried to clean up a good chunk of this test. I hope it reads cleaner. I think I'll leave a todo to split this test up. Each rule should have it's own set of tests, and these tests are nowhere near exhaustive and don't ensure the scores are correct.

pkg/storage/rule_solver_test.go, line 103 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

optional: you could make these comments into a "name" field on the testCase struct above, and then use that as the name of a sub-test for each case.

Funny, I already did just that while cleaning up the test before even getting to this comment.

pkg/storage/rule_solver_test.go, line 238 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

if you use sub-tests, remember to remove the integer from this message and the one below.

Done.

pkg/storage/rule_solver_test.go, line 253 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

"stores"? that's not an OK name.

Done.

pkg/storage/rule_solver_test.go, line 326 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

this is not a legal way to construct a store list in general, and we shouldn't use it.

Done.

pkg/storage/rule_solver_test.go, line 334 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

same comment about descriptive names for sub-tests (currently the test cases are inscrutable) and removing the index if you do so.

Done.

Comments from Reviewable

tamird · 2016-11-09T16:19:04Z

Reviewed 7 of 7 files at r2.
Review status: all files reviewed at latest revision, 33 unresolved discussions, all commit checks successful.

pkg/storage/allocator.go, line 507 at r1 (raw file):

Previously, BramGruneir (Bram Gruneir) wrote…

This is an old and odd comment. I'm not entirely sure what it's referencing. Obviously we have rebalancing tests.

OK...but you're introducing it (or moving it) here; perhaps you could investigate and remove it if it's nonsense rather than kicking the can down to the next sap?

pkg/storage/allocator.go, line 315 at r2 (raw file):

              tierOrder:   canonicalTierOrder(sl),
              tiers:       storeTierMap(sl),
          }); valid {

shouldn't this be valid && candidate.score < worstScore? Currently if valid is false the candidate will be picked up.

Or perhaps it should be !valid || candidate.score < worstScore - either way, the score variable is just confusing things.

pkg/storage/allocator.go, line 544 at r2 (raw file):

  rebalanceConvergesOnMean := rebalanceFromConvergesOnMean(sl, store)

  result :=

why rename this? prefer you leave the old name so that this is pure movement.

pkg/storage/allocator_test.go, line 1750 at r1 (raw file):

Previously, BramGruneir (Bram Gruneir) wrote…

Sure, I'll extract out the shared code. But of course they'll produce different results, they're using a different function when choosing when to rebalance. What matters here is that the new output looks good.

thanks for sharing the code. However, I still think we need to track down why they produce different results. My understanding was that this PR was introducing the solver facility, not changing rebalancing heuristics. Granted, the new stuff is going in behind a flag, but we should understand exactly why the behaviour is different (and kill it, because the behaviour should be the same to the extent that we can test it).

pkg/storage/rule_solver.go, line 73 at r1 (raw file):

Previously, BramGruneir (Bram Gruneir) wrote…

Going to leave this for now, it's in the follow up issue list already.

It has one caller. Is that really worth addressing in a follow up?

pkg/storage/rule_solver.go, line 91 at r1 (raw file):

Previously, BramGruneir (Bram Gruneir) wrote…

Done.

This comment now seems over-specified, and documents implementation, not semantics. Consider that there may be cases where we want to optimize the implementation such that we no longer call computeCandidate on every store; why would you constrain yourself with such a doc comment?

pkg/storage/rule_solver.go, line 114 at r1 (raw file):

Previously, BramGruneir (Bram Gruneir) wrote…

Done.

Same comment as above. This is overspecified; we may not want to run all the rules in the future - this phrasing may encourage someone to introduce side effects, which would be bad.

pkg/storage/rule_solver.go, line 232 at r1 (raw file):

Previously, BramGruneir (Bram Gruneir) wrote…

That's by design. If most of the nodes don't have a tier, we don't add it. But we can short circuit the loop if that's the case.

I do not understand what you're saying here. Why does this function count the stores that don't have a tier? it is never useful.

pkg/storage/rule_solver.go, line 262 at r1 (raw file):

Previously, BramGruneir (Bram Gruneir) wrote…

Done.

On second thought, please make this an ascending sort, change the comment to "byScore implements sort.Interface." and then use https://godoc.org/sort#Reverse in the callers that need it.

pkg/storage/rule_solver.go, line 45 at r2 (raw file):

type rule struct {
  weight float64
  // when run returns false will remove the store from the list of candidate

this is an odd way to phrase this comment.

Consider that this comment is akin to a doc comment on an interface method; it should explain the semantics from the caller's perspective as well as from the implementer's perspective.

pkg/storage/rule_solver.go, line 129 at r2 (raw file):

// ruleReplicasUniqueNodes returns true iff no existing replica is present on
// the node.

the candidate's node. all these rules execute in the context of a store, correct?

pkg/storage/rule_solver.go, line 161 at r2 (raw file):

// ruleConstraints returns true iff all required and prohibited constraints are
// followed. Stores with more positive constraints return higher scores.

s/followed/satisfied/

s/Stores with more positive constraints return higher scores./The score returned is the count of matching positive constraints./

pkg/storage/rule_solver.go, line 182 at r2 (raw file):

}

// ruleDiversity returns higher scores for stores with the fewest locality tiers

in common with what? this sentence needs help.

pkg/storage/rule_solver.go, line 208 at r2 (raw file):

// ruleCapacity returns true iff a new replica won't overfill the store. The
// returned score is based on how empty nodes are, with the most empty nodes

The score returned is inversely proportional to the number of ranges on the candidate store?

also might be a good idea to break this into two rules, one which deals with the capacity fraction and another which deals with the range count.

pkg/storage/rule_solver.go, line 220 at r2 (raw file):

// canonicalTierOrder returns the most common key at each tier level in
// decreasing order of the top tier level down.

"of the top tier level down" - what does this mean?

pkg/storage/rule_solver_test.go, line 48 at r1 (raw file):

Previously, BramGruneir (Bram Gruneir) wrote…

nope, you're right, the comment was wrong.

OK, but this comment is still useless, isn't it?

pkg/storage/rule_solver_test.go, line 40 at r2 (raw file):

func (c byScoreAndID) Swap(i, j int) { c[i], c[j] = c[j], c[i] }

// TestRuleSolver tests that ruleCapacity, ruleDiversity, ruleConstraints,

somewhat meta: why overspecify comments everywhere? naming the rules here doesn't help anything - you're already use subtests which have the names of the things they're testing and are less likely to rot.

pkg/storage/rule_solver_test.go, line 41 at r2 (raw file):

// TestRuleSolver tests that ruleCapacity, ruleDiversity, ruleConstraints,
// ruleReplicasUniqueNodes and that the solver itself functions as expected.

"functions as expected" is not a useful comment. if there's nothing useful to say here, i would prefer we omit the comment.

pkg/storage/rule_solver_test.go, line 58 at r2 (raw file):

  // tierSetup returns a tier struct constructed using the passed in values.
  // If slot is an empty string, it is not included.

this treatment is not unique to "slot".

pkg/storage/rule_solver_test.go, line 60 at r2 (raw file):

  // If slot is an empty string, it is not included.
  tierSetup := func(datacenter, floor, rack, slot string) []roachpb.Tier {
      tiers := []roachpb.Tier{}

var tiers roachpb.Tier is more consistent with our style

pkg/storage/rule_solver_test.go, line 80 at r2 (raw file):

  capacitySetup := func(available int64) roachpb.StoreCapacity {
      return roachpb.StoreCapacity{
          Capacity:   100,

something is off here; capacity and available are byte counts, and rangecount is a range count - why are you subtracting bytes from ranges?

pkg/storage/rule_solver_test.go, line 119 at r2 (raw file):

      // Store 1: score 0; Store 3: score 1; everything else fails.
      {
          name: "arbitrary rule",

how about "whitelist rule"? and then remove the comment above.

pkg/storage/rule_solver_test.go, line 124 at r2 (raw file):

              run: func(state solveState) (float64, bool) {
                  switch state.store.StoreID {
                  case 1:

since it amounts to rewriting the test, i won't insist, but please consider avoiding magic numbers (which are still repeated all over this test). You're putting 1,3 in here, which only works because you created the store pool above with 1,2,3,5 - what I was expecting was an extraction of some local variables or constants that held these values and were referred to by name in these test cases.

pkg/storage/rule_solver_test.go, line 135 at r2 (raw file):

          expected: []roachpb.StoreID{3, 1},
      },
      // Test that ruleReplicasUniqueNodes correctly avoids overlapping

i think you can/should remove all these comments.

pkg/storage/rule_solver_test.go, line 247 at r2 (raw file):

      t.Run(tc.name, func(t *testing.T) {
          var rules []rule
          if tc.rule.run != nil {

am i not seeing it? seems like they are all non-nil and this can be removed.

pkg/storage/rule_solver_test.go, line 257 at r2 (raw file):

          }
          sort.Sort(byScoreAndID(candidates))
          if len(candidates) != len(tc.expected) {

why don't you use reflect.DeepEqual and then print pretty.Diff in the failure case?

pkg/storage/rule_solver_test.go, line 259 at r2 (raw file):

          if len(candidates) != len(tc.expected) {
              t.Errorf("length of %+v should match %+v", candidates, tc.expected)
              return

this is equivalent to t.Fatalf on the preceding line.

Comments from Reviewable

BramGruneir · 2016-11-09T20:06:34Z

Review status: 5 of 8 files reviewed at latest revision, 33 unresolved discussions.

pkg/storage/allocator.go, line 507 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

OK...but you're introducing it (or moving it) here; perhaps you could investigate and remove it if it's nonsense rather than kicking the can down to the next sap?

Done. Of course tests that rebalance use it. Removed.

pkg/storage/allocator.go, line 315 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

shouldn't this be valid && candidate.score < worstScore? Currently if valid is false the candidate will be picked up.

Or perhaps it should be !valid || candidate.score < worstScore - either way, the score variable is just confusing things.

!valid just means we use a score of 0. We want it to be picked up.

I've cleaned the code again and added a clearer comment.

pkg/storage/allocator.go, line 544 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

why rename this? prefer you leave the old name so that this is pure movement.

I'm not a fan of having a variable and the function named the same thing. I can revert if you think it's best left alone.

pkg/storage/allocator_test.go, line 1750 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

thanks for sharing the code. However, I still think we need to track down why they produce different results. My understanding was that this PR was introducing the solver facility, not changing rebalancing heuristics. Granted, the new stuff is going in behind a flag, but we should understand exactly why the behaviour is different (and kill it, because the behaviour should be the same to the extent that we can test it).

I can pinpoint exactly why. It's because instead of being ordered randomly, the rebalance target is chosen based on ruleCapacity's score.

At this point, I think it's safe to leave it in as two different tests with different outputs.

pkg/storage/rule_solver.go, line 73 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

It has one caller. Is that really worth addressing in a follow up?

Done.

pkg/storage/rule_solver.go, line 91 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

This comment now seems over-specified, and documents implementation, not semantics. Consider that there may be cases where we want to optimize the implementation such that we no longer call computeCandidate on every store; why would you constrain yourself with such a doc comment?

Done.

pkg/storage/rule_solver.go, line 114 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

Same comment as above. This is overspecified; we may not want to run all the rules in the future - this phrasing may encourage someone to introduce side effects, which would be bad.

Done.

pkg/storage/rule_solver.go, line 232 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

I do not understand what you're saying here. Why does this function count the stores that don't have a tier? it is never useful.

If the majority of nodes do not have a tier, we don't consider any tiers lower than it. This is used as part of ruleDiversity.

pkg/storage/rule_solver.go, line 262 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

On second thought, please make this an ascending sort, change the comment to "byScore implements sort.Interface." and then use https://godoc.org/sort#Reverse in the callers that need it.

It's only used once and that would be very inefficient. I've renamed it to be clearer.

pkg/storage/rule_solver.go, line 45 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

this is an odd way to phrase this comment.

Consider that this comment is akin to a doc comment on an interface method; it should explain the semantics from the caller's perspective as well as from the implementer's perspective.

somehow the comment got mangled. Cleaned up.

pkg/storage/rule_solver.go, line 129 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

the candidate's node. all these rules execute in the context of a store, correct?

Yeah. Fixed.

pkg/storage/rule_solver.go, line 161 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

s/followed/satisfied/

s/Stores with more positive constraints return higher scores./The score returned is the count of matching positive constraints./

yes to the first one, but the second correction is incorrect.

The total is divided by total number of constraints.

pkg/storage/rule_solver.go, line 182 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

in common with what? this sentence needs help.

Done.

pkg/storage/rule_solver.go, line 208 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

The score returned is inversely proportional to the number of ranges on the candidate store?

also might be a good idea to break this into two rules, one which deals with the capacity fraction and another which deals with the range count.

Fixed.

I'm going to leave it as one rule for now. But I'm going to be altering the rules in the future so I'll add a todo.

pkg/storage/rule_solver.go, line 220 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

"of the top tier level down" - what does this mean?

Reworded a bit. Just not sure how to phrase it.

As an example:
region = USA/EU
datacenter = us-e / us-w / eu-n / eu-s
rack = 1/2/3/4/5

So from the order is from region down to rack

pkg/storage/rule_solver_test.go, line 48 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

OK, but this comment is still useless, isn't it?

Done.

pkg/storage/rule_solver_test.go, line 40 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

somewhat meta: why overspecify comments everywhere? naming the rules here doesn't help anything - you're already use subtests which have the names of the things they're testing and are less likely to rot.

In this case, it was because I'd like to split this test up into individual rule tests.

But for the most part, it was a backlash against the very poor commenting in the code originally.

pkg/storage/rule_solver_test.go, line 41 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

"functions as expected" is not a useful comment. if there's nothing useful to say here, i would prefer we omit the comment.

Done.

pkg/storage/rule_solver_test.go, line 58 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

this treatment is not unique to "slot".

Done.

pkg/storage/rule_solver_test.go, line 60 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

var tiers roachpb.Tier is more consistent with our style

Done.

pkg/storage/rule_solver_test.go, line 80 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

something is off here; capacity and available are byte counts, and rangecount is a range count - why are you subtracting bytes from ranges?

You're right. I'm going to change up the function. All the tests cases were designed liked this.

pkg/storage/rule_solver_test.go, line 119 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

how about "whitelist rule"? and then remove the comment above.

perfect

pkg/storage/rule_solver_test.go, line 124 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

since it amounts to rewriting the test, i won't insist, but please consider avoiding magic numbers (which are still repeated all over this test). You're putting 1,3 in here, which only works because you created the store pool above with 1,2,3,5 - what I was expecting was an extraction of some local variables or constants that held these values and were referred to by name in these test cases.

Yeah, it would be a lot cleaner with a full re-write. I tried to do what I could. Let me know what you think, I can always revert it. But I'm going to spend a good amount of time re-writing this whole test later.

pkg/storage/rule_solver_test.go, line 135 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

i think you can/should remove all these comments.

Done.

pkg/storage/rule_solver_test.go, line 247 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

am i not seeing it? seems like they are all non-nil and this can be removed.

Yes, in the first test case where there is no rule.

pkg/storage/rule_solver_test.go, line 257 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

why don't you use reflect.DeepEqual and then print pretty.Diff in the failure case?

Can't, it's only comparing ids, not scores. Another reason why I want to re-write and split up the test.

pkg/storage/rule_solver_test.go, line 259 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

this is equivalent to t.Fatalf on the preceding line.

Done.

Comments from Reviewable

tamird · 2016-11-09T20:36:28Z

Reviewed 3 of 3 files at r3.
Review status: all files reviewed at latest revision, 19 unresolved discussions, all commit checks successful.

pkg/storage/allocator.go, line 507 at r1 (raw file):

Previously, BramGruneir (Bram Gruneir) wrote…

Done. Of course tests that rebalance use it. Removed.

Ah, sorry, I didn't see that this is pure movement (because you didn't delete the original code until a later revision). Let's restore it so that indeed it is movement without modification.

By the way, I think you should consider leaving this code in the original file it was in (despite the fact that you're changing the signatures) to minimize the diff and acquisition of git blame, but up to you.

pkg/storage/allocator.go, line 315 at r2 (raw file):

Previously, BramGruneir (Bram Gruneir) wrote…

!valid just means we use a score of 0. We want it to be picked up.

I've cleaned the code again and added a clearer comment.

Somehow I am still not getting it. This code looks for the "worst" candidate based on score, and it ignores the "valid" return because that's equivalent to a score of zero. However, a score of zero is _also_ the lowest possible score, so why doesn't this immediately return if `valid`?

pkg/storage/allocator.go, line 544 at r2 (raw file):

Previously, BramGruneir (Bram Gruneir) wrote…

I'm not a fan of having a variable and the function named the same thing. I can revert if you think it's best left alone.

Yes, and also see comment above about avoiding this part of the diff completely.

pkg/storage/allocator_test.go, line 1750 at r1 (raw file):

Previously, BramGruneir (Bram Gruneir) wrote…

I can pinpoint exactly why. It's because instead of being ordered randomly, the rebalance target is chosen based on ruleCapacity's score.

At this point, I think it's safe to leave it in as two different tests with different outputs.

I'm going to insist that you not do this. If you can pinpoint exactly why they are different, then please make them the same here. Once this change is rolled out and considered safe, we can use the new "better" semantics.

pkg/storage/rule_solver.go, line 262 at r1 (raw file):

Previously, BramGruneir (Bram Gruneir) wrote…

It's only used once and that would be very inefficient. I've renamed it to be clearer.

Huh? Take another look at `sort.Reverse`, it is exactly identical in performance. The caller would write `sort.Sort(sort.Reverse(byScore(...)))`. In my opinion, we should be using that, rather than going against the grain like this.

pkg/storage/rule_solver.go, line 161 at r2 (raw file):

Previously, BramGruneir (Bram Gruneir) wrote…

yes to the first one, but the second correction is incorrect.

The total is divided by total number of constraints.

"The score returned is proportional to the count of matching positive constraints."? the current phrasing implies that stores somehow have constraints, which they do not.

pkg/storage/rule_solver.go, line 220 at r2 (raw file):

Previously, BramGruneir (Bram Gruneir) wrote…

Reworded a bit. Just not sure how to phrase it.

As an example:
region = USA/EU
datacenter = us-e / us-w / eu-n / eu-s
rack = 1/2/3/4/5

So from the order is from region down to rack

The use of the word "important" makes no sense to me, and the bit about "cutting off" also doesn't.

As far as I can tell, this is trying to stitch together the localities of all the nodes, but it's doing so in a way that doesn't make any sense, and what's worse is that the returned tiers have no value, only a key, leaving me wondering why a Tier is returned at all.

pkg/storage/rule_solver.go, line 47 at r3 (raw file):

  // run is a function that given a solveState will score and possibly
  // disqualify a store. The store in solveState can be disqualified by
  // returning false. Unless disqualified, the higher the returned score, the

random tab?

pkg/storage/rule_solver_test.go, line 124 at r2 (raw file):

Previously, BramGruneir (Bram Gruneir) wrote…

Yeah, it would be a lot cleaner with a full re-write. I tried to do what I could. Let me know what you think, I can always revert it. But I'm going to spend a good amount of time re-writing this whole test later.

Looks good, thanks.

pkg/storage/rule_solver_test.go, line 247 at r2 (raw file):

Previously, BramGruneir (Bram Gruneir) wrote…

Yes, in the first test case where there is no rule.

Ah, I see.

pkg/storage/rule_solver_test.go, line 118 at r3 (raw file):

      {
          name:     "no constraints or rules",
          expected: []roachpb.StoreID{storeRSa, 2, 3, 5},

did you forget to change the other magic numbers here?

pkg/storage/rule_solver_test.go, line 135 at r3 (raw file):

              },
          },
          expected: []roachpb.StoreID{3, storeRSa},

"3"?

pkg/storage/rule_solver_test.go, line 144 at r3 (raw file):

              {NodeID: roachpb.NodeID(storeFRabc)},
          },
          expected: []roachpb.StoreID{storeRab, 5},

did you forget to change "5"?

Comments from Reviewable

Instead of applying 1ef40f3 or cockroachdb#10252, this finishes the reapplication of the rule solver. However, this also puts the rule solver under the environment flag COCKROACH_ENABLE_RULE_SOLVER for ease of testing and defaults to not enabled. This commit re-applies the rule solver, specifically the following commits: 1) 4446345 storage: add constraint rule solver for allocation Rules are represented as a single function that returns the candidacy of the store as well as a float value representing the score. These scores are then aggregated from all rules and returns the stores sorted by them. Current rules: - ruleReplicasUniqueNodes ensures that no two replicas are put on the same node. - ruleConstraints enforces that required and prohibited constraints are followed, and that stores with more positive constraints are ranked higher. - ruleDiversity ensures that nodes that have the fewest locality tiers in common are given higher priority. - ruleCapacity prioritizes placing data on empty nodes when the choice is available and prevents data from going onto mostly full nodes. 2) dd3229a storage: implemented RuleSolver into allocator The follow up to this commit is cockroachdb#10275 and a lot of testing to ensure that the rule solver does indeed perform as expected. Closes cockroachdb#9336

BramGruneir · 2016-11-09T22:01:19Z

Review status: all files reviewed at latest revision, 18 unresolved discussions, all commit checks successful.

pkg/storage/allocator.go, line 507 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

Ah, sorry, I didn't see that this is pure movement (because you didn't delete the original code until a later revision). Let's restore it so that indeed it is movement without modification.

By the way, I think you should consider leaving this code in the original file it was in (despite the fact that you're changing the signatures) to minimize the diff and acquisition of git blame, but up to you.

Done. I'm fine with taking the blame.

pkg/storage/allocator.go, line 315 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

Somehow I am still not getting it. This code looks for the "worst" candidate based on score, and it ignores the "valid" return because that's equivalent to a score of zero. However, a score of zero is also the lowest possible score, so why doesn't this immediately return if valid?

I'll put in a quick exit if there's ever a score of 0.

pkg/storage/allocator.go, line 544 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

Yes, and also see comment above about avoiding this part of the diff completely.

Done.

pkg/storage/allocator_test.go, line 1750 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

I'm going to insist that you not do this. If you can pinpoint exactly why they are different, then please make them the same here. Once this change is rolled out and considered safe, we can use the new "better" semantics.

And I'm going to remain steadfast on not doing this. It would fundamentally change the rule solver and every test for it. The constraint rule would have to change and the sort by score would need to be removed as well. This work can be done in a follow up.

pkg/storage/rule_solver.go, line 262 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

Huh? Take another look at sort.Reverse, it is exactly identical in performance. The caller would write sort.Sort(sort.Reverse(byScore(...))). In my opinion, we should be using that, rather than going against the grain like this.

Done.

pkg/storage/rule_solver.go, line 161 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

"The score returned is proportional to the count of matching positive constraints."? the current phrasing implies that stores somehow have constraints, which they do not.

Fixed. Yes, they're attributes and localities.

pkg/storage/rule_solver.go, line 220 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

The use of the word "important" makes no sense to me, and the bit about "cutting off" also doesn't.

As far as I can tell, this is trying to stitch together the localities of all the nodes, but it's doing so in a way that doesn't make any sense, and what's worse is that the returned tiers have no value, only a key, leaving me wondering why a Tier is returned at all.

Diversity is looking for the most dissimilar localities. To so do, it first calls this function that looks for the most common localities keys (not the values) and then looks at the diversity within the common tiers' values. Is this the best way of performing this task? I'm pretty sure it's not. But I'm not going to muck around with the logic at this point.

But you are correct, there's no need to return tiers when a string slice would be perfectly acceptable. Changed.

As for the wording, does broadest to the most local tier work?

pkg/storage/rule_solver.go, line 47 at r3 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

random tab?

weird. Looks just a space in my editor. Should be removed.

pkg/storage/rule_solver_test.go, line 118 at r3 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

did you forget to change the other magic numbers here?

yep. Done.

pkg/storage/rule_solver_test.go, line 135 at r3 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

"3"?

Done.

pkg/storage/rule_solver_test.go, line 144 at r3 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

did you forget to change "5"?

Done.

Comments from Reviewable

tamird · 2016-11-09T22:31:58Z

Reviewed 3 of 3 files at r4.
Review status: all files reviewed at latest revision, 8 unresolved discussions, all commit checks successful.

pkg/storage/allocator_test.go, line 1750 at r1 (raw file):

Previously, BramGruneir (Bram Gruneir) wrote…

And I'm going to remain steadfast on not doing this. It would fundamentally change the rule solver and every test for it. The constraint rule would have to change and the sort by score would need to be removed as well. This work can be done in a follow up.

Seems we are at an impasse!

pkg/storage/rule_solver.go, line 220 at r2 (raw file):

Previously, BramGruneir (Bram Gruneir) wrote…

Diversity is looking for the most dissimilar localities. To so do, it first calls this function that looks for the most common localities keys (not the values) and then looks at the diversity within the common tiers' values. Is this the best way of performing this task? I'm pretty sure it's not. But I'm not going to muck around with the logic at this point.

But you are correct, there's no need to return tiers when a string slice would be perfectly acceptable. Changed.

As for the wording, does broadest to the most local tier work?

I believe, though, that the logic here is broken. Consider the following localities:

node1: "region=usa, dc=1, rack=2"
node2: "region=usa, dc=1, floor=7, room=5"

How do you decide which of "rack" or "floor" is more specific? or is the point of the "most nodes don't have" that it's impossible to have such undecidability?

Comments from Reviewable

petermattis · 2016-11-09T22:38:05Z

Review status: all files reviewed at latest revision, 8 unresolved discussions, all commit checks successful.

pkg/storage/allocator_test.go, line 1750 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

Seems we are at an impasse!

The new allocator has different heuristics that are not identical to the old ones. I don't think the rule-based allocator in this PR is ready for production. We need to re-add the randomization, but I think that should be done in a follow-on PR.

Comments from Reviewable

tamird · 2016-11-09T22:55:56Z

Review status: all files reviewed at latest revision, 8 unresolved discussions, all commit checks successful.

pkg/storage/allocator_test.go, line 1750 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

The new allocator has different heuristics that are not identical to the old ones. I don't think the rule-based allocator in this PR is ready for production. We need to re-add the randomization, but I think that should be done in a follow-on PR.

In my opinion, doing such iterative work in multiple PRs makes it harder to observe that we reach the right end state, not easier.

That said, I seem to have been overruled on this point. So be it.

Comments from Reviewable

BramGruneir · 2016-11-10T15:26:05Z

Review status: all files reviewed at latest revision, 8 unresolved discussions, all commit checks successful.

pkg/storage/rule_solver.go, line 220 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

I believe, though, that the logic here is broken. Consider the following localities:
node1: "region=usa, dc=1, rack=2"
node2: "region=usa, dc=1, floor=7, room=5"
How do you decide which of "rack" or "floor" is more specific? or is the point of the "most nodes don't have" that it's impossible to have such undecidability?

Oh, I agree entirely. And if you have

node1: "region=usa"
node2: "region=usa"
node3: "region=usa, dc=1"
node4: "region=usa, dc=1"

Then only region usa would show up. But if the order is reversed, than dc would be picked. So this is clearly not the right solution. But I'd rather tackle this work in a follow up PR that concentrates on the diversity rule specifically.

Comments from Reviewable

BramGruneir · 2016-11-10T18:28:41Z

@tamird, So I think I've covered everything here.. Am I good to merge?

Review status: all files reviewed at latest revision, 8 unresolved discussions, all commit checks successful.

Comments from Reviewable

tamird · 2016-11-10T18:43:50Z

You have Pete's approval.

On Nov 10, 2016 13:28, "Bram Gruneir" [email protected] wrote:

@tamird https://github.com/tamird, So I think I've covered everything

here.. Am I good to merge?

Review status: all files reviewed at latest revision, 8 unresolved

discussions, all commit checks successful.

Comments from Reviewable
https://reviewable.io:443/reviews/cockroachdb/cockroach/10507#-:-KWEUk0nilFNbSgWz376:b-cgj8iz

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#10507 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABdsPEG51lek6F5Spa5k2V2Ih3q3ZcQqks5q82JcgaJpZM4KrZa8
.

BramGruneir added the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label Nov 7, 2016

BramGruneir force-pushed the rules branch from 193f057 to 581c2bc Compare November 7, 2016 16:22

jseldess added the docs-todo label Nov 8, 2016

BramGruneir force-pushed the rules branch from 581c2bc to d3b6a4f Compare November 9, 2016 05:34

BramGruneir force-pushed the rules branch from d3b6a4f to e651674 Compare November 9, 2016 20:06

BramGruneir force-pushed the rules branch from e651674 to a1f49a0 Compare November 9, 2016 22:01

BramGruneir mentioned this pull request Nov 10, 2016

storage: rule solver post merge issues #10275

Closed

19 tasks

BramGruneir merged commit 25449d9 into cockroachdb:master Nov 10, 2016

BramGruneir deleted the rules branch November 10, 2016 19:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: Re-apply the rule solver. #10507

storage: Re-apply the rule solver. #10507

BramGruneir commented Nov 7, 2016 •

edited

Loading

tamird commented Nov 7, 2016

tamird commented Nov 7, 2016

tamird commented Nov 7, 2016

BramGruneir commented Nov 7, 2016

petermattis commented Nov 7, 2016

tamird commented Nov 7, 2016

petermattis commented Nov 7, 2016

tamird commented Nov 7, 2016

petermattis commented Nov 7, 2016

tamird commented Nov 7, 2016

BramGruneir commented Nov 9, 2016

tamird commented Nov 9, 2016

BramGruneir commented Nov 9, 2016

tamird commented Nov 9, 2016

BramGruneir commented Nov 9, 2016

tamird commented Nov 9, 2016

petermattis commented Nov 9, 2016

tamird commented Nov 9, 2016

BramGruneir commented Nov 10, 2016

BramGruneir commented Nov 10, 2016

tamird commented Nov 10, 2016

here.. Am I good to merge?

discussions, all commit checks successful.

storage: Re-apply the rule solver. #10507

storage: Re-apply the rule solver. #10507

Conversation

BramGruneir commented Nov 7, 2016 • edited Loading

tamird commented Nov 7, 2016

tamird commented Nov 7, 2016

tamird commented Nov 7, 2016

BramGruneir commented Nov 7, 2016

petermattis commented Nov 7, 2016

tamird commented Nov 7, 2016

petermattis commented Nov 7, 2016

tamird commented Nov 7, 2016

petermattis commented Nov 7, 2016

tamird commented Nov 7, 2016

BramGruneir commented Nov 9, 2016

tamird commented Nov 9, 2016

BramGruneir commented Nov 9, 2016

tamird commented Nov 9, 2016

BramGruneir commented Nov 9, 2016

tamird commented Nov 9, 2016

petermattis commented Nov 9, 2016

tamird commented Nov 9, 2016

BramGruneir commented Nov 10, 2016

BramGruneir commented Nov 10, 2016

tamird commented Nov 10, 2016

here.. Am I good to merge?

discussions, all commit checks successful.

BramGruneir commented Nov 7, 2016 •

edited

Loading