Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epoch-based range leases implementation #10305

Merged
merged 1 commit into from
Dec 8, 2016

Conversation

spencerkimball
Copy link
Member

@spencerkimball spencerkimball commented Oct 28, 2016

Introduce new epoch-based range leases. These are designed to use
the same machinery as the expiration-based leases but use epochs from
the node liveness table instead of expirations.

The same Lease protobuf is utilized for both types of leases, but
there's now an optional Epoch. Previously, the lease proto had a "stasis"
timestamp that's now removed and replaced by logic in the replica to
evaluate the state of a lease.

In order to evaluate whether a lease is valid at command apply time
(downstream of Raft), we evaluate the lease upstream of Raft and send
it with every Raft command to be compared to the lease at apply time.

See: https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/range_leases.md


This change is Reviewable

@spencerkimball
Copy link
Member Author

+cc @andreimatei

@spencerkimball spencerkimball force-pushed the spencerkimball/epoch-range-leases branch 3 times, most recently from 0aded85 to c76b6b3 Compare October 29, 2016 14:20
@petermattis
Copy link
Collaborator

@tschottdorf definitely needs to put his eyes on this.


Review status: 0 of 41 files reviewed at latest revision, 8 unresolved discussions, some commit checks failed.


build/npm.installed, line 0 at r1 (raw file):
Pretty sure this shouldn't be checked in. @tamird?


pkg/roachpb/data.proto, line 313 at r1 (raw file):

  optional ReplicaDescriptor replica = 3 [(gogoproto.nullable) = false];

  reserved 4;

I think we discussed not using reserved and instead renaming fields. Though I can't recall the reason why.


pkg/roachpb/data_test.go, line 593 at r1 (raw file):

      expSuccess bool
  }{
      {epL1, epL1, true},    // same epoch lease

I think this might be clearer if you define helper methods to create epoch and expiration based leases. For example:

epoch := func(r ReplicaDescriptor, start hlc.Timestamp, epoch int) Lease {
}
expiration := func(r ReplicaDescriptor, start, expiration hlc.Timestamp) Lease {
}

I'm asking for this because it was visually difficult to distinguish between the prefixes ep and ex when reading these test cases.


pkg/storage/below_raft_protos_test.go, line 92 at r1 (raw file):

      populatedConstructor: func(r *rand.Rand) proto.Message { return roachpb.NewPopulatedLease(r, false) },
      emptySum:             15357864913567733633,
      populatedSum:         10752200836664780297,

Isn't it problematic that this is changing? Will this PR require a freeze-cluster upgrade?


pkg/storage/replica.go, line 735 at r1 (raw file):

// getLease returns the current lease, and the tentative next one, if a lease
// request initiated by this replica is in progress.
func (r *Replica) getLease() (roachpb.Lease, roachpb.Lease) {

Any reason you changed these to values? Using nil to indicate not present seemed nicer than returning the zero roachpb.Lease.


pkg/storage/replica_range_lease.go, line 267 at r1 (raw file):

//   minus the maximum clock offset).
// - The lease is considered in stasis if the timestamp is within the
//   maximum clock offset window of the lease expiration.

I thought the previous stasis period was present so that we could adjust the max clock offset without fouling up previous leases. Guess I'm not following why it is safe to remove that field now.


pkg/storage/replica_range_lease.go, line 283 at r1 (raw file):

//   processed the change in lease holdership).
// * the client fails to read their own write.
func (r *Replica) leaseStatus(lease *roachpb.Lease, timestamp hlc.Timestamp) (status LeaseStatus) {

Why the named return value? Those are generally frowned upon, especially when there is only a single return value like here.


pkg/storage/storagebase/proposer_kv.proto, line 34 at r1 (raw file):

      (gogoproto.casttype) = "github.com/cockroachdb/cockroach/pkg/roachpb.RangeID"];
  optional roachpb.ReplicaDescriptor origin_replica = 2 [(gogoproto.nullable) = false];
  // origin_lease is provided to verify at raft command apply-time that

I realize using origin here is meant to be consistent with origin_replica, but perhaps proposer would be more accurate (in both cases).


Comments from Reviewable

@petermattis petermattis added the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label Oct 29, 2016
@spencerkimball spencerkimball force-pushed the spencerkimball/epoch-range-leases branch 2 times, most recently from 5337712 to f5ddf81 Compare October 31, 2016 20:01
@spencerkimball
Copy link
Member Author

Review status: 0 of 41 files reviewed at latest revision, 8 unresolved discussions.


build/npm.installed, line at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Pretty sure this shouldn't be checked in. @tamird?

Done.

pkg/roachpb/data.proto, line 313 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I think we discussed not using reserved and instead renaming fields. Though I can't recall the reason why.

Weird I can't recall reason either.

pkg/roachpb/data_test.go, line 593 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I think this might be clearer if you define helper methods to create epoch and expiration based leases. For example:

epoch := func(r ReplicaDescriptor, start hlc.Timestamp, epoch int) Lease {
}
expiration := func(r ReplicaDescriptor, start, expiration hlc.Timestamp) Lease {
}

I'm asking for this because it was visually difficult to distinguish between the prefixes ep and ex when reading these test cases.

I clarified naming but differently than requested.

pkg/storage/below_raft_protos_test.go, line 92 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Isn't it problematic that this is changing? Will this PR require a freeze-cluster upgrade?

Yes I think we'll need a freeze cluster upgrade as we could get checksum issues between replicas.

pkg/storage/replica.go, line 735 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Any reason you changed these to values? Using nil to indicate not present seemed nicer than returning the zero roachpb.Lease.

I had originally run into various complicated problems using a pointer to the lease so I just changed them all for clarity. We could have taken a less drastic step, but this seemed reasonable as leases are small.

pkg/storage/replica_range_lease.go, line 267 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I thought the previous stasis period was present so that we could adjust the max clock offset without fouling up previous leases. Guess I'm not following why it is safe to remove that field now.

We are not supporting changing max clock offset without a stop the world cluster restart. We now enforce identical clock offsets between nodes: https://github.com//pull/9612

See discussion around max offset here: https://reviewable.io/reviews/cockroachdb/cockroach/9530#-KSY-22S-5_ZlsGZ280o-r4-33


pkg/storage/replica_range_lease.go, line 283 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Why the named return value? Those are generally frowned upon, especially when there is only a single return value like here.

Done.

pkg/storage/storagebase/proposer_kv.proto, line 34 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I realize using origin here is meant to be consistent with origin_replica, but perhaps proposer would be more accurate (in both cases).

Done.

Comments from Reviewable

@spencerkimball spencerkimball force-pushed the spencerkimball/epoch-range-leases branch 3 times, most recently from 4814c4f to 240a377 Compare October 31, 2016 22:26
@andreimatei
Copy link
Contributor

it's nice how not many things had to change :)
But I don't see new tests... am I missing them?


Review status: 0 of 41 files reviewed at latest revision, 17 unresolved discussions, some commit checks failed.


pkg/storage/replica.go, line 785 at r2 (raw file):

// redirectOnOrAcquireLease checks whether this replica has the lease at the
// current timestamp. If it does, returns success. If another replica currently

"current timestamp" seems stale. Also the new timestamp param should have a comment.

But... is the fact that we're completely shifting to using a request timestamp a good thing? For example, don't we want to renew a lease if it's about to expire even if the current request has an older timestamp?
Conversely, did we have a bug before because we wouldn't serve reads with older timestamps because the lease was in stasis (according to the current hlc)?


pkg/storage/replica.go, line 881 at r2 (raw file):

      case pErr = <-llChan:
          if pErr == nil {
              log.Eventf(ctx, "lease acquisition succeeded: %+v", status.lease)

Why not VInfof(2, ...) ? Also for these calls above.


pkg/storage/replica.go, line 2732 at r2 (raw file):

  cmd, cmdProposedLocally := r.mu.proposals[idKey]

  isLeaseError := func() bool {

I think the commit message should call out that you've fixed the previously possible (I think...) situation: a replica proposes a command under lease A, and then quickly transfers A away, then gets another lease B, and the applies the command (the application's check of the lease would succeed)


pkg/storage/replica.go, line 2817 at r2 (raw file):

      // holder.
      log.VEventf(
          ctx, 1, "command proposed from replica %+v (lease at %v): %s: %v",

nit: s/lease at/proposed under lease


pkg/storage/replica_range_lease.go, line 33 at r3 (raw file):

)

// pendingLeaseRequest coalesces RequestLease requests and lets callers

I think this comments needs to be expanded now that this struct handles two types of leases. That's magical.


pkg/storage/replica_range_lease.go, line 132 at r3 (raw file):

  if replica.store.Stopper().RunAsyncTask(context.TODO(), func(ctx context.Context) {
      ctx = replica.AnnotateCtx(ctx)

how about extracting this closure in a top-level function, to keep functions reasonably-sized?


pkg/storage/replica_range_lease.go, line 136 at r3 (raw file):

      // If epoch-based lease & state is expired, increment epoch or heartbeat.
      if reqLease.Epoch != 0 && status.liveness != nil && status.state == leaseExpired {

I think you've created an accessor for Epoch that tells you if it's expiration-based. Use it here?
Also why do you need to check both Epoch and liveness?


pkg/storage/replica_range_lease.go, line 151 at r3 (raw file):

          }
          // Set error for propagation to all waiters below.
          pErr = roachpb.NewError(newNotLeaseHolderError(&status.lease, replica.store.StoreID(), replica.Desc()))

as we discussed, you only wanna set this pErr is err was set above.
I'd call "let's add a test" :)


pkg/storage/replica_state.go, line 142 at r3 (raw file):

  lease *roachpb.Lease,
) error {
  if lease == nil {

maybe replace this with a lease.Empty() check?


Comments from Reviewable

andreimatei added a commit to andreimatei/cockroach that referenced this pull request Oct 31, 2016
This has come up in cockroachdb#10305: we have a test that sets the MaxOffset to a
very high value because it will push the clock and it doesn't want the
MaxOffset checks that assert that commands are not coming from the (far)
future to complain. That unfortunately doesn't quite work, as the
MaxOffset is used for the leases' stasis period (which apparently only
10305 starts checking) and so the leases would be in always in stasis.

Also moved the MaxOffset to the TestingKnobs, so that people are not
encouraged to use it.

Also cleaned up some MaxOffsets that are no longer needed since we
stopped starting new leases at curTime + MaxOffset.
@bdarnell
Copy link
Contributor

bdarnell commented Nov 1, 2016

Reviewed 26 of 41 files at r1, 2 of 8 files at r2, 1 of 1 files at r3, 12 of 12 files at r4.
Review status: all files reviewed at latest revision, 34 unresolved discussions, some commit checks failed.


pkg/roachpb/data.proto, line 313 at r1 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Weird I can't recall reason either.

The reason to rename instead of reserve is that if you find yourself decoding an old protobuf you can still see the old field's name. But when we discussed this the final decision was to just use `reserved`.

pkg/sql/txn_restart_test.go, line 1136 at r4 (raw file):

// the clock, so that the transaction timestamp exceeds the deadline of the EndTransactionRequest.
func TestReacquireLeaseOnRestart(t *testing.T) {
  t.Skip("setting max offset > the lease expiration prevents use of lease")

Unless this skip is removed before merge, it should link to an issue number tracking what is necessary to reinstate this test.


pkg/storage/below_raft_protos_test.go, line 92 at r1 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Yes I think we'll need a freeze cluster upgrade as we could get checksum issues between replicas.

That means we'll need to fix freeze-cluster before this can go in.

pkg/storage/client_raft_test.go, line 1047 at r4 (raw file):

  // get a range lease.
  // TODO(bdarnell): understand why some tests need this.
  mtc.expireLeases()

Be sure to run this test (and others where you've made this change) under stress to make sure that the expireLeases call is no longer necessary.


pkg/storage/client_test.go, line 1019 at r4 (raw file):

// expired. If incEpochs=true, each node epoch is incremented and live
// nodes are heartbeat so they can acquire leases.
func (m *multiTestContext) expireLeasesAndIncrementEpochs(incEpochs bool) {

Shouldn't this be the other way around? If expireLeases also increments epochs, then the only use of this function should be to pass false to it to avoid the increment. Then you could have expireLeases call expireLeasesWithoutIncrementEpoch, which would no longer need an argument.


pkg/storage/node_liveness.go, line 82 at r4 (raw file):

  livenessThreshold time.Duration
  heartbeatInterval time.Duration
  pauseHeartbeat    atomic.Value

Comment that this atomic.Value contains a bool.


pkg/storage/node_liveness.go, line 181 at r4 (raw file):

// Heartbeat triggers a heartbeat outside of the normal periodic
// heartbeat loop. Used for unittesting.
func (nl *NodeLiveness) Heartbeat(ctx context.Context, liveness *Liveness) error {

Either move this to helpers_test.go or just rename heartbeat to Heartbeat. There's no need to have two methods with the same signature whose names only differ in case in the same scope.


pkg/storage/node_liveness.go, line 203 at r4 (raw file):

      // Set actual liveness on unexpected mismatch.
      newLiveness = actual
      // If the actual liveness is further ahead than expected, we're done.

How could that happen? It seems like we'd at least need to compare epochs and not just expirations.


pkg/storage/node_liveness.go, line 204 at r4 (raw file):

      newLiveness = actual
      // If the actual liveness is further ahead than expected, we're done.
      if !actual.Expiration.Less(newLiveness.Expiration) {

Thanks to the assignment on the previous line, this condition is always true.


pkg/storage/replica.go, line 785 at r2 (raw file):

For example, don't we want to renew a lease if it's about to expire even if the current request has an older timestamp?

For most ranges, leases are no longer renewed through this path. For those that are, it would be slightly better to renew them regardless of the request timestamp, but I don't think it matters much and therefore is not worth the complexity - the difference between the request timestamp and HLC should be small and if one request doesn't renew the lease, the next will.

Conversely, did we have a bug before because we wouldn't serve reads with older timestamps because the lease was in stasis

Probably, but we'd still serve those reads because the lease would get renewed (and the request redirected if necessary) transparently. It would just be a performance issue.


pkg/storage/replica.go, line 756 at r4 (raw file):

// IsLeaseValid returns true if the replica's lease is owned by this
// replica and is valid (not expired, not in stasis).

Remove the mention of stasis from this comment.


pkg/storage/replica.go, line 901 at r4 (raw file):

              }
              pErr = roachpb.NewError(err)
              log.Infof(ctx, "")

Remove this.


pkg/storage/replica.go, line 909 at r4 (raw file):

          pErr = roachpb.NewError(newNotLeaseHolderError(nil, r.store.StoreID(), r.Desc()))
      }
      return LeaseStatus{}, pErr

Add a panic if pErr is nil to make sure we don't reach this point without setting it.


pkg/storage/replica.go, line 1190 at r4 (raw file):

          // if we know the current lease holder for that range, which
          // indicates that our knowledge is not stale.
          if repl.haveLease(r.store.Clock().Now()) {

The old code was checking for a valid lease held by any node; haveLease looks for this node to be the lease holder (which is incorrect here)


pkg/storage/replica.go, line 1695 at r4 (raw file):

      }
  } else {
      lease, _ = r.getLease()

This passes a potentially-bogus lease down through raft. I think we should use a zero-valued (or nil, undoing the pointer-to-value change) lease here and repeat the IsSingleSkipLeaseCheckRequest downstream of raft to decide whether we need to validate the proposal's lease or not.


pkg/storage/replica.go, line 2818 at r4 (raw file):

      // proposal due to Raft delays / reorderings.
      //
      // We make an exception if this command is a freeze change (which

RequestLease and TransferLease also skip the lease check upstream.


pkg/storage/replica_range_lease.go, line 150 at r4 (raw file):

              }
          }
          // Set error for propagation to all waiters below.

We return an error even if Heartbeat or IncrementEpoch succeeded? The flow here is unclear.


pkg/storage/replica_range_lease.go, line 230 at r4 (raw file):

//
// The replica lock must be held.
func (r *Replica) expiringLease() bool {

Rename this to expiringLeaseLocked.


pkg/storage/replica_range_lease.go, line 238 at r4 (raw file):

const (
  leaseValid   leaseState = iota // can be used

Put leaseError first so that an uninitialized leaseState doesn't default to leaseValid.


pkg/storage/replica_range_lease.go, line 244 at r4 (raw file):

)

// LeaseStatus holds the least state, the timestamp at which the state

s/least/lease/


Comments from Reviewable

@tbg
Copy link
Member

tbg commented Nov 1, 2016

Only gave this a high-level overview so far, looks roughly like I expected (but then again, the devil is the details, and haven't looked at those). Definitely needs a lot of tests written. What's the plan going forward w.r.t #10327? I strongly suspect that #10327 will go in first, and then this PR would be the natural one for fixing up proposer-evaluated KV's lease handling (but either way it would have to be compatible with #10327 and thus be rebased on top of that, though likely worth holding off for a few days hoping that my PR can merge until then).


Reviewed 26 of 41 files at r1, 2 of 8 files at r2, 1 of 1 files at r3, 12 of 12 files at r4.
Review status: all files reviewed at latest revision, 41 unresolved discussions, some commit checks failed.


pkg/roachpb/data.go, line 976 at r4 (raw file):

// Verify determines whether ol is either the exact same lease,
// or else is an extension of the same expiration-based lease.
func (l Lease) Verify(ol Lease) error {

Verify is a weird name for this.


pkg/storage/client_raft_test.go, line 974 at r4 (raw file):

// TestStoreRangeDownReplicate verifies that the replication queue will notice
// over-replicated ranges and remove replicas from them.
func TestStoreRangeDownReplicate(t *testing.T) {

@tamird: ideally we would stress this test in CI since its contents changed significantly. I don't think that's currently happening, right? Interesting problem, but I think the diff hunks show the method header for each hunk and perhaps you can grep those? Or that's already happening.


pkg/storage/helpers_test.go, line 195 at r4 (raw file):

}

// LeaseStatus exposes replica.leaseStatus for tests.

nit: (*Replica).leaseStatus.


pkg/storage/node_liveness.go, line 149 at r4 (raw file):

              ctx, sp := ambient.AnnotateCtxWithSpan(context.Background(), "heartbeat")
              // Retry heartbeat in the event the conditional put fails.
              for r := retry.StartWithCtx(ctx, base.DefaultRetryOptions()); r.Next(); {

Looks like this context should have a timeout of nl.heartbeatInterval.


pkg/storage/replica.go, line 735 at r1 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

I had originally run into various complicated problems using a pointer to the lease so I just changed them all for clarity. We could have taken a less drastic step, but this seemed reasonable as leases are small.

`state.Lease` needs to be a pointer for proposer-evaluated KV (well, could hack around it, but I don't see evidence supporting that it should be a special snowflake).

pkg/storage/replica.go, line 785 at r2 (raw file):

the difference between the request timestamp and HLC should be small and if one request doesn't renew the lease, the next will.

Long-running transactions could have a fairly stale timestamp. I don't think we should make this (drive-by?) change since it seems backward (and it is, we used to do it that way and switched), unless there's a good reason.

Comment is generally stale (return value).


pkg/storage/replica.go, line 756 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Remove the mention of stasis from this comment.

Stasis is still a thing, though.

Why is this a method on Replica? Looks like it should be on *Store. That would also make the variety of different methods that deal with leases more transparent.


pkg/storage/replica.go, line 1768 at r4 (raw file):

  pd.ReplicatedProposalData = storagebase.ReplicatedProposalData{
      RangeID:         r.RangeID,
      ProposerReplica: replica,

Isn't this in the lease and thus this can go?


pkg/storage/replica_proposal.go, line 130 at r4 (raw file):

      return errors.New("conflicting RangeDescriptor")
  }
  if p.State.Lease.Empty() {

Not terribly happy about these kinds of drive-bys. Not being able to distinguish between an empty lease proto and a nil one can lead to problems. Persisting one or the other doesn't give the same byte-wise on-disk state and we should keep a bijection here. Then again, dealing with all the nils is also a pain. But you would definitely need to make sure we never persist anything that would pass Empty().


pkg/storage/replica_range_lease.go, line 283 at r4 (raw file):

//   processed the change in lease holdership).
// * the client fails to read their own write.
func (r *Replica) leaseStatus(lease *roachpb.Lease, timestamp hlc.Timestamp) LeaseStatus {

Why is this on *Replica? It should live on *Store.


Comments from Reviewable

andreimatei added a commit to andreimatei/cockroach that referenced this pull request Nov 1, 2016
This has come up in cockroachdb#10305: we have a test that sets the MaxOffset to a
very high value because it will push the clock and it doesn't want the
MaxOffset checks that assert that commands are not coming from the (far)
future to complain. That unfortunately doesn't quite work, as the
MaxOffset is used for the leases' stasis period (which apparently only
10305 starts checking) and so the leases would be in always in stasis.

Also moved the MaxOffset to the TestingKnobs, so that people are not
encouraged to use it.

Also cleaned up some MaxOffsets that are no longer needed since we
stopped starting new leases at curTime + MaxOffset.
@andreimatei
Copy link
Contributor

Review status: all files reviewed at latest revision, 41 unresolved discussions, some commit checks failed.


pkg/sql/txn_restart_test.go, line 1136 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Unless this skip is removed before merge, it should link to an issue number tracking what is necessary to reinstate this test.

I merged #10351 so hopefully this doesn't need to be skipped any more.

Comments from Reviewable

@spencerkimball
Copy link
Member Author

@bdarnell I'm getting started again on this because of course it's rotted to brown pulp. I did the massive merge last night and am working on tests now. I just had some spare time, but my real impetus is for proposer evaluated KV. That feels like it might end up being a crucial step before GA. I'm wondering whether you're actively working on it or not to gauge timing. Clearly I'll need to have this merged and well debugged before we take that next step.

@bdarnell
Copy link
Contributor

Yeah, if things will just stop being on fire then propEvalKV is next on my plate. I think I'll be starting with the command queue changes (#10413) before I get to anything lease-related, though (#10414). I agree that propEvalKV (or rather the zero-downtime upgrades that it enables) will be a requirement for GA.

@dianasaur323 dianasaur323 mentioned this pull request Dec 2, 2016
37 tasks
@spencerkimball
Copy link
Member Author

Review status: all files reviewed at latest revision, 41 unresolved discussions, some commit checks failed.


pkg/roachpb/data.go, line 976 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Verify is a weird name for this.

Changed to Equivalent


pkg/roachpb/data.proto, line 313 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

The reason to rename instead of reserve is that if you find yourself decoding an old protobuf you can still see the old field's name. But when we discussed this the final decision was to just use reserved.

Done.


pkg/sql/txn_restart_test.go, line 1136 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I merged #10351 so hopefully this doesn't need to be skipped any more.

Done.


pkg/storage/below_raft_protos_test.go, line 92 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

That means we'll need to fix freeze-cluster before this can go in.

What's latest on this? @andreimatei changed this already and now I'm changing it again...and have to.


pkg/storage/client_raft_test.go, line 974 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

@tamird: ideally we would stress this test in CI since its contents changed significantly. I don't think that's currently happening, right? Interesting problem, but I think the diff hunks show the method header for each hunk and perhaps you can grep those? Or that's already happening.

I've done significant stress tests on existing unittests. I'll leave the rest of this question to @tamird.


pkg/storage/client_raft_test.go, line 1047 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Be sure to run this test (and others where you've made this change) under stress to make sure that the expireLeases call is no longer necessary.

Done. Didn't fail after 1000 runs.


pkg/storage/client_test.go, line 1019 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Shouldn't this be the other way around? If expireLeases also increments epochs, then the only use of this function should be to pass false to it to avoid the increment. Then you could have expireLeases call expireLeasesWithoutIncrementEpoch, which would no longer need an argument.

Good point. Changed.


pkg/storage/helpers_test.go, line 195 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

nit: (*Replica).leaseStatus.

Done.


pkg/storage/node_liveness.go, line 82 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Comment that this atomic.Value contains a bool.

Done.


pkg/storage/node_liveness.go, line 149 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Looks like this context should have a timeout of nl.heartbeatInterval.

Done.


pkg/storage/node_liveness.go, line 181 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Either move this to helpers_test.go or just rename heartbeat to Heartbeat. There's no need to have two methods with the same signature whose names only differ in case in the same scope.

Done.


pkg/storage/node_liveness.go, line 203 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

How could that happen? It seems like we'd at least need to compare epochs and not just expirations.

The supplied liveness could race with a concurrent update from the periodic heartbeating thread. I added a comment. Remember that this Heartbeat() method is called from the periodic heartbeater as well as from the code which does lease acquisition in the event that the previous leaseholder and aspiring new leaseholder had its epoch incremented.


pkg/storage/node_liveness.go, line 204 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Thanks to the assignment on the previous line, this condition is always true.

Oops, fixed.


pkg/storage/replica.go, line 735 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

state.Lease needs to be a pointer for proposer-evaluated KV (well, could hack around it, but I don't see evidence supporting that it should be a special snowflake).

Why does it need to be? There's no hacking necessary. Just calling lease.Empty() is equivalent to lease == nil.


pkg/storage/replica.go, line 785 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

the difference between the request timestamp and HLC should be small and if one request doesn't renew the lease, the next will.

Long-running transactions could have a fairly stale timestamp. I don't think we should make this (drive-by?) change since it seems backward (and it is, we used to do it that way and switched), unless there's a good reason.

Comment is generally stale (return value).

OK and cleaned up comment a bit.


pkg/storage/replica.go, line 881 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Why not VInfof(2, ...) ? Also for these calls above.

This is a useful piece of info. Why put it at verbose level 2?


pkg/storage/replica.go, line 2732 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I think the commit message should call out that you've fixed the previously possible (I think...) situation: a replica proposes a command under lease A, and then quickly transfers A away, then gets another lease B, and the applies the command (the application's check of the lease would succeed)

Done.


pkg/storage/replica.go, line 2817 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

nit: s/lease at/proposed under lease

Done.


pkg/storage/replica.go, line 756 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Stasis is still a thing, though.

Why is this a method on Replica? Looks like it should be on *Store. That would also make the variety of different methods that deal with leases more transparent.

It has to stay here in order to accommodate the minLeaseProposedTS which @andreimatei added.


pkg/storage/replica.go, line 901 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Remove this.

Done.


pkg/storage/replica.go, line 909 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Add a panic if pErr is nil to make sure we don't reach this point without setting it.

Done.


pkg/storage/replica.go, line 1190 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

The old code was checking for a valid lease held by any node; haveLease looks for this node to be the lease holder (which is incorrect here)

Done.


pkg/storage/replica.go, line 1695 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This passes a potentially-bogus lease down through raft. I think we should use a zero-valued (or nil, undoing the pointer-to-value change) lease here and repeat the IsSingleSkipLeaseCheckRequest downstream of raft to decide whether we need to validate the proposal's lease or not.

Good idea. Done.


pkg/storage/replica.go, line 1768 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Isn't this in the lease and thus this can go?

Probably better to remove it, but it's iffy because there are commands without leases and what I must do now is set the Replica field of an empty lease in that case.


pkg/storage/replica.go, line 2818 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

RequestLease and TransferLease also skip the lease check upstream.

They definitely don't skip the lease check. RequestLease verifies that the lease is not valid and TransferLease verifies that it is valid and held. Then, at this point, we verify that the lease status which existed when the command was proposed is still in effect now that we're processing the command. Maybe I'm not understanding your comment.


pkg/storage/replica_proposal.go, line 130 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Not terribly happy about these kinds of drive-bys. Not being able to distinguish between an empty lease proto and a nil one can lead to problems. Persisting one or the other doesn't give the same byte-wise on-disk state and we should keep a bijection here. Then again, dealing with all the nils is also a pain. But you would definitely need to make sure we never persist anything that would pass Empty().

The Replicafield of Lease is what determines the Empty() check, and it is always set when a command is sent through Raft. It is further always set when r.mu.state.Lease is persisted. I've added assertions to that effect.


pkg/storage/replica_range_lease.go, line 33 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I think this comments needs to be expanded now that this struct handles two types of leases. That's magical.

Done.


pkg/storage/replica_range_lease.go, line 132 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

how about extracting this closure in a top-level function, to keep functions reasonably-sized?

Done.


pkg/storage/replica_range_lease.go, line 136 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I think you've created an accessor for Epoch that tells you if it's expiration-based. Use it here?
Also why do you need to check both Epoch and liveness?

Done.


pkg/storage/replica_range_lease.go, line 151 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

as we discussed, you only wanna set this pErr is err was set above.
I'd call "let's add a test" :)

Yes quite right. Will add a test. This code path gets tested plenty by our unittests, btw. It works because the replica gets the not lease holder error and retries.


pkg/storage/replica_range_lease.go, line 150 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

We return an error even if Heartbeat or IncrementEpoch succeeded? The flow here is unclear.

No, that was an error. The code still worked in unittests because the replica would just retry on receipt of the NotLeaseHolderError


pkg/storage/replica_range_lease.go, line 230 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Rename this to expiringLeaseLocked.

Done.


pkg/storage/replica_range_lease.go, line 238 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Put leaseError first so that an uninitialized leaseState doesn't default to leaseValid.

Done.


pkg/storage/replica_range_lease.go, line 244 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/least/lease/

Done.


pkg/storage/replica_range_lease.go, line 283 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Why is this on *Replica? It should live on *Store.

It can't. We need the replica's minLeaseProposedTS


pkg/storage/replica_state.go, line 142 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

maybe replace this with a lease.Empty() check?

Done.


Comments from Reviewable

@spencerkimball spencerkimball force-pushed the spencerkimball/epoch-range-leases branch 2 times, most recently from ba8bf9f to 69f8a58 Compare December 3, 2016 19:28
@tamird
Copy link
Contributor

tamird commented Dec 3, 2016

Reviewed 1 of 41 files at r1, 25 of 49 files at r5.
Review status: 25 of 49 files reviewed at latest revision, 46 unresolved discussions, some commit checks pending.


pkg/storage/client_raft_test.go, line 913 at r5 (raw file):

		{DisableRefreshReasonNewLeader: true, DisableRefreshReasonSnapshotApplied: true},
	}
	for tcIdx, c := range testCases {

use a subtest instead - this test already wraps itself in a function, so the change will be trivial.


pkg/storage/client_raft_test.go, line 973 at r5 (raw file):

			for i := 0; i < 2; i++ {
				if err := mtc.stores[i].DrainLeases(true); err != nil {
					t.Fatalf("test case %d, store %d: %v", tcIdx, i, err)

revert this when you go to a subtest here.


pkg/storage/replica_state.go, line 142 at r3 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Done.

Haven't the semantics here changed, now? It seems odd for this function to swallow empty leases, and this makes it quite difficult to see which callers of this function may encounter this scenario.


pkg/storage/stores.go, line 84 at r5 (raw file):

func (ls *Stores) GetStore(storeID roachpb.StoreID) (*Store, error) {
	ls.mu.RLock()
	defer ls.mu.RUnlock()

why change this?


pkg/storage/storagebase/proposer_kv.proto, line 119 at r5 (raw file):

  // If the command doesn't require a lease, origin_lease will be empty,
  // with the exception of the Replica field.
  optional roachpb.Lease origin_lease = 5 [(gogoproto.nullable) = false];

why rely on empty here? seems like you want it to be nil when a lease is not required.


pkg/storage/storagebase/state.proto, line 53 at r5 (raw file):

  roachpb.RangeDescriptor desc = 3;
  // The latest range lease.
  roachpb.Lease lease = 4 [(gogoproto.nullable) = false];

isn't this what's causing the below raft protos check to fail? why do you need to change this nullability? it seems like you've just swapped it for a Empty() method.


Comments from Reviewable

@tamird
Copy link
Contributor

tamird commented Dec 5, 2016

Reviewed 15 of 43 files at r6.
Review status: 18 of 46 files reviewed at latest revision, 43 unresolved discussions, some commit checks failed.


pkg/storage/below_raft_protos_test.go, line 92 at r1 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

OK, I'm now avoiding necessity of using freeze cluster. So we never used freeze cluster? Sounds like a lesson is in there somewhere.

So this still changed?


pkg/storage/client_raft_test.go, line 920 at r6 (raw file):

	}
	for tcIdx, c := range testCases {
		t.Run(fmt.Sprintf("test case %d", tcIdx), func(t *testing.T) {

you can call Run("", ... and the test runner will automatically use an incrementing integer index. See 0103e37


Comments from Reviewable

@spencerkimball spencerkimball force-pushed the spencerkimball/epoch-range-leases branch from 29cb6a8 to c7e137d Compare December 5, 2016 19:40
@spencerkimball
Copy link
Member Author

Review status: 17 of 45 files reviewed at latest revision, 39 unresolved discussions, some commit checks failed.


pkg/storage/below_raft_protos_test.go, line 92 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

So this still changed?

Well we now have an Epoch. So empty it's the same as it is nullable. However, when it's set to something, we obviously have a different checksum.


pkg/storage/client_raft_test.go, line 920 at r6 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

you can call Run("", ... and the test runner will automatically use an incrementing integer index. See 0103e37

Done.


Comments from Reviewable

@spencerkimball spencerkimball force-pushed the spencerkimball/epoch-range-leases branch 2 times, most recently from 65d8b7e to 31ad27d Compare December 5, 2016 22:39
@andreimatei
Copy link
Contributor

Reviewed 1 of 8 files at r2.
Review status: 13 of 45 files reviewed at latest revision, 52 unresolved discussions, some commit checks failed.


pkg/storage/replica.go, line 881 at r2 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

This is a useful piece of info. Why put it at verbose level 2?

VInfof is more visible than Eventf. The latter only goes to the trace. The former goes to trace and, optionally, to the log.


pkg/storage/replica.go, line 1694 at r7 (raw file):

	// Admin commands always require the range lease.
	status, pErr := r.redirectOnOrAcquireLease(ctx)

I suggest reverting this change and not passing status to AdminTransferLease


pkg/storage/replica.go, line 2173 at r7 (raw file):

//   which case the other returned values are zero.
func (r *Replica) propose(
	ctx context.Context, lease *roachpb.Lease, ba roachpb.BatchRequest, endCmds *endCmds,

please document the new param


pkg/storage/replica.go, line 3213 at r7 (raw file):

	verifyLease := func() error {
		if raftCmd.BatchRequest.IsSingleSkipLeaseCheckRequest() ||

IsSingleSkipLeaseCheckRequest refers to proposal-time lease checking. I don't think it's OK to use it here; you don't want to skip any apply-side checks for them


pkg/storage/replica_range_lease.go, line 43 at r7 (raw file):

// clock offset. Epoch-based leases do not expire, but rely on the
// leaseholder maintaining its node liveness record (also a lease, but
// at the node level).  All ranges up to and including the node

nit: double space


pkg/storage/replica_range_lease.go, line 45 at r7 (raw file):

// at the node level).  All ranges up to and including the node
// liveness table must use expiration-based leases to avoid any
// circular dependencies. Note that a range can switch from expiration

the last sentence scares the reader. Also the implementation of requiresExpiringLeaseLocked doesn't seem to support "switching" of any kind. Why did you feel the need to put this note here?


pkg/storage/replica_range_lease.go, line 153 at r7 (raw file):

		return llChan
	}
	p.llChans = append(p.llChans, llChan)

it's a bit subtle why it's correct to append to llChans after calling requestLeaseAsync - because InitJoinReq has to be called under r.mu, as documented on the struct comment, and the async method locks that before signaling completion. It'd be nice to add a comment, or move the appending above and handle the error case somehow.


pkg/storage/replica_range_lease.go, line 172 at r7 (raw file):

		// If epoch-based lease & state is expired, increment epoch or heartbeat.
		if reqLease.Type() == roachpb.LeaseEpoch && status.liveness != nil && status.state == leaseExpired {

what does it mean for status.liveness to be nil? The docs on LeaseStatus don't suggest that as a possibility. Can we get rid of the check? Also, can we make liveness not be a pointer?


pkg/storage/replica_range_lease.go, line 288 at r7 (raw file):

// epoch-based.
type LeaseStatus struct {
	state     leaseState    // state of the lease

// state of the lease in relationship to timestamp


pkg/storage/replica_range_lease.go, line 289 at r7 (raw file):

type LeaseStatus struct {
	state     leaseState    // state of the lease
	timestamp hlc.Timestamp // timestamp the lease was evaluated at

mmm this pattern seems weird to me... You're tying a timestamp to a Lease, and a LeaseState. I can see how that's sometimes convenient, but isn't it confusing that a bunch of methods that only used to need a timestamp now get this weird struct?
Was there a particular reason you've introduced this bundling?


pkg/storage/replica_range_lease.go, line 295 at r7 (raw file):

// GetLeaseStatus gets the lease status for the replica's currently
// held lease.

... in relationship with timestamp.


pkg/storage/replica_range_lease.go, line 352 at r7 (raw file):

		var err error
		status.liveness, err = r.store.cfg.NodeLiveness.GetLiveness(lease.Replica.NodeID)
		if err != nil || status.liveness.Epoch < *lease.Epoch {

Can status.liveness.Epoch < *lease.Epochhappen? If not, turn it into a panic instead of bundling it in this condition?


pkg/storage/replica_range_lease.go, line 431 at r7 (raw file):

// this method joins in waiting for it to complete if it's transferring to the
// same replica. Otherwise, a NotLeaseHolderError is returned.
func (r *Replica) AdminTransferLease(target roachpb.StoreID, status LeaseStatus) error {

the new param should be documented.

But I don't think it should have been introduced at all. Why not let AdminTransferLease read the lease for itself?


pkg/storage/replica_range_lease.go, line 269 at r8 (raw file):

//
// The replica lock must be held.
func (r *Replica) requiresExpiringLeaseLocked() bool {

instead of this method, can we have a simple member boolean initialized to the constant answer?


pkg/storage/replica_range_lease.go, line 270 at r8 (raw file):

// The replica lock must be held.
func (r *Replica) requiresExpiringLeaseLocked() bool {
	return r.store.cfg.NodeLiveness == nil || !enableEpochLeases ||

what does it mean for cfg.NoveLiveness to be nil?


Comments from Reviewable

@spencerkimball
Copy link
Member Author

Review status: 13 of 45 files reviewed at latest revision, 52 unresolved discussions, some commit checks failed.


pkg/storage/replica.go, line 881 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

VInfof is more visible than Eventf. The latter only goes to the trace. The former goes to trace and, optionally, to the log.

OK, but I'd like to just make the point that this stuff has become a mess. There should be only one way to do things, for good or bad.


pkg/storage/replica.go, line 1694 at r7 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I suggest reverting this change and not passing status to AdminTransferLease

See later comment. redirectOnOrAcquireLease does the verification now, NOT processRaftCommand post-Raft. If we fetched the lease in AdminTransferLease, we could potentially use someone else's lease, or an invalid lease, which we could do successfully! Remember that post-Raft, what matters is the lease matches the proposer-verified, and supplied, lease.


pkg/storage/replica.go, line 2173 at r7 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

please document the new param

Done.


pkg/storage/replica.go, line 3213 at r7 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

IsSingleSkipLeaseCheckRequest refers to proposal-time lease checking. I don't think it's OK to use it here; you don't want to skip any apply-side checks for them

There is no lease specified if that flag is true on the batch request (i.e. the lease is neither checked pre-Raft or post-Raft). Added a comment.


pkg/storage/replica_range_lease.go, line 43 at r7 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

nit: double space

Done.


pkg/storage/replica_range_lease.go, line 45 at r7 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

the last sentence scares the reader. Also the implementation of requiresExpiringLeaseLocked doesn't seem to support "switching" of any kind. Why did you feel the need to put this note here?

The reader is easily scared. requiresExpiringLeaseLocked certainly does support switching. You simply change the value of the enableEpochRangeLeases. This is necessary to support @petermattis' request that we be able to enable and disable this in prod. There is now a unittest for this as well.


pkg/storage/replica_range_lease.go, line 153 at r7 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

it's a bit subtle why it's correct to append to llChans after calling requestLeaseAsync - because InitJoinReq has to be called under r.mu, as documented on the struct comment, and the async method locks that before signaling completion. It'd be nice to add a comment, or move the appending above and handle the error case somehow.

I'm going to leave a TODO here for you.


pkg/storage/replica_range_lease.go, line 172 at r7 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

what does it mean for status.liveness to be nil? The docs on LeaseStatus don't suggest that as a possibility. Can we get rid of the check? Also, can we make liveness not be a pointer?

Removed the check, though I'm going to keep liveness a pointer on LeaseStatus because the NodeLiveness object traffics in *Liveness and simply copying here to a non-pointer struct feels like it could yield future fragility for unclear benefit.


pkg/storage/replica_range_lease.go, line 288 at r7 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

// state of the lease in relationship to timestamp

Done.


pkg/storage/replica_range_lease.go, line 289 at r7 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

mmm this pattern seems weird to me... You're tying a timestamp to a Lease, and a LeaseState. I can see how that's sometimes convenient, but isn't it confusing that a bunch of methods that only used to need a timestamp now get this weird struct?
Was there a particular reason you've introduced this bundling?

They're all necessary through the stack at one place or another. Obviously we can't use the timestamp anymore. We need the lease primarily, but turns out we also need the timestamp and of course we need the state (although we could probably trim that one and return two variables). The liveness is necessary to increment the epoch if required when we request the lease. This is the bundle of stuff that's necessary, like it or not. Returning a bunch of values sounds worse to me.


pkg/storage/replica_range_lease.go, line 295 at r7 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

... in relationship with timestamp.

Done.


pkg/storage/replica_range_lease.go, line 352 at r7 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Can status.liveness.Epoch < *lease.Epochhappen? If not, turn it into a panic instead of bundling it in this condition?

Yes. When evaluating a lease held by another replica, it's possible that gossip is not as up-to-date as Raft's propagation of the actual lease. It's not such a big deal as you really only care about this for the owner of the lease and the owner of the lease is not dependent on gossip. However, we obviously don't want nodes panicking when checking if another node's lease is valid (say for metrics).


pkg/storage/replica_range_lease.go, line 431 at r7 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

the new param should be documented.

But I don't think it should have been introduced at all. Why not let AdminTransferLease read the lease for itself?

Because then it would really need to verify it again. That's the job of Replica.redirectOnOrAcquireLease.


pkg/storage/replica_range_lease.go, line 269 at r8 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

instead of this method, can we have a simple member boolean initialized to the constant answer?

I really hate adding new member variables to Replica.


pkg/storage/replica_range_lease.go, line 270 at r8 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

what does it mean for cfg.NoveLiveness to be nil?

Unittests.


Comments from Reviewable

@spencerkimball spencerkimball force-pushed the spencerkimball/epoch-range-leases branch 2 times, most recently from 87a50dd to 7952470 Compare December 6, 2016 06:32
@spencerkimball
Copy link
Member Author

Review status: 13 of 46 files reviewed at latest revision, 37 unresolved discussions, some commit checks pending.


pkg/storage/replica_range_lease.go, line 172 at r7 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Removed the check, though I'm going to keep liveness a pointer on LeaseStatus because the NodeLiveness object traffics in *Liveness and simply copying here to a non-pointer struct feels like it could yield future fragility for unclear benefit.

Actually, this is a very important predicate. When I took it out all hell broke loose. This is a subtle predicate. What you have to keep in mind is that the status of the current lease has already been determined to need updating. What this is meant to do is check whether that prior lease was epoch-based and is now expired. If we're also requesting an epoch-based lease now, then we either need to heartbeat our own liveness record or else increment the prior owner's epoch. I think what was confusing was using status.liveness != nil. I've changed that to status.lease.Type() == LeaseEpoch.


Comments from Reviewable

@spencerkimball spencerkimball force-pushed the spencerkimball/epoch-range-leases branch from 7952470 to 97f1e4c Compare December 6, 2016 07:16
@bdarnell
Copy link
Contributor

bdarnell commented Dec 6, 2016

:lgtm:

I've done another pass over this monster to make sure it's safe without freeze-cluster; it looks good except for my comment about OriginReplica in replica.go


Reviewed 33 of 43 files at r6, 4 of 9 files at r8, 7 of 7 files at r9.
Review status: 45 of 46 files reviewed at latest revision, 45 unresolved discussions, some commit checks failed.


pkg/roachpb/data.proto, line 313 at r9 (raw file):

  optional ReplicaDescriptor replica = 3 [(gogoproto.nullable) = false];

  // The start of the lease stasis period. This field deprecated.

s/field/field is/


pkg/storage/below_raft_protos_test.go, line 92 at r1 (raw file):

Sounds like a lesson is in there somewhere.

Yeah, but the lesson is a bit ambiguous. A half-finished freeze-cluster implementation was a waste of time in retrospect, but if we had finished it and used it routinely we could have avoided the bugs around the 20161110 release (and probably simplified some other transitions too). I had assumed from the beginning that propEvalKV would require freeze-cluster because we'd be unable to support a feature-flagged transition, but Tobi made that work.


pkg/storage/replica.go, line 843 at r9 (raw file):

// haveLease returns whether this replica is the current valid leaseholder.
func (r *Replica) haveLease(ts hlc.Timestamp) bool {

s/have/has/


pkg/storage/replica.go, line 3220 at r9 (raw file):

			return nil
		}
		if raftCmd.OriginLease == nil || r.mu.state.Lease == nil {

OriginLease will be nil for log entries proposed before this change; we must validate them using the origin replica and timestamp instead.


pkg/storage/replica.go, line 3222 at r9 (raw file):

		if raftCmd.OriginLease == nil || r.mu.state.Lease == nil {
			if raftCmd.OriginLease == r.mu.state.Lease {
				return nil

When would this happen? (r.mu.state.Lease is nil but it's not a SkipLease command and we want to proceed)


pkg/storage/replica.go, line 3248 at r9 (raw file):

		forcedErr = roachpb.NewErrorf("no-op on empty Raft entry")
	} else if err := verifyLease(); err != nil {
		// Verify the lease matches the proposer's expectation. We rely on

Move this comment to the definition of verifyLease.


pkg/storage/replica_command.go, line 1638 at r9 (raw file):

	// MIGRATION(tschottdorf): needed to apply Raft commands which got proposed
	// before the StartStasis field was introduced.
	if args.Lease.StartStasis.Equal(hlc.ZeroTimestamp) {

We need to keep this in to avoid a freeze-cluster. (OK, so it's a long shot that any log entries this old will still be around, but still. We'll do a big purge of this kind of stuff after propEvalKV is in)


pkg/storage/store.go, line 830 at r9 (raw file):

	}

	if !cfg.EnableEpochRangeLeases {

This pattern means that it's impossible for StoreConfig to set EnableEpochRangeLeases to false when the env var (or default) is true. That didn't matter much for EnableCoaleascedHeartbeats but it's going to break the new "switcheroo" test you added when we flip the default.


pkg/storage/storagebase/proposer_kv.proto, line 121 at r9 (raw file):

  // origin_lease is provided to verify at raft command apply-time that
  // the lease under which the command was proposed remains in effect.
  // If the command doesn't require a lease, origin_lease will be nil.

Add "If the command was proposed prior to the introduction of epoch leases, origin_lease will be nil, but the combination of origin_replica and timestamp are sufficient to verify an expiration-based lease."


Comments from Reviewable

@bdarnell
Copy link
Contributor

bdarnell commented Dec 6, 2016

Review status: 45 of 46 files reviewed at latest revision, 45 unresolved discussions, some commit checks failed.


pkg/storage/replica.go, line 3213 at r7 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

There is no lease specified if that flag is true on the batch request (i.e. the lease is neither checked pre-Raft or post-Raft). Added a comment.

This is what I was just referring to in the meeting. IsSingleSkipLeaseCheckRequest is true for RequestLease and TransferLease, so in propEvalKV the proposer had stale information it could propose an improper lease which would be blindly applied by all followers. That's not new with this PR; it's tracked separately in #10414.


Comments from Reviewable

@spencerkimball
Copy link
Member Author

Review status: 45 of 46 files reviewed at latest revision, 45 unresolved discussions, some commit checks failed.


pkg/roachpb/data.proto, line 313 at r9 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/field/field is/

Done.


pkg/storage/replica.go, line 3213 at r7 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This is what I was just referring to in the meeting. IsSingleSkipLeaseCheckRequest is true for RequestLease and TransferLease, so in propEvalKV the proposer had stale information it could propose an improper lease which would be blindly applied by all followers. That's not new with this PR; it's tracked separately in #10414.

OK, I was missing something there.


pkg/storage/replica.go, line 843 at r9 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/have/has/

Done.


pkg/storage/replica.go, line 3220 at r9 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

OriginLease will be nil for log entries proposed before this change; we must validate them using the origin replica and timestamp instead.

Done.


pkg/storage/replica.go, line 3222 at r9 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

When would this happen? (r.mu.state.Lease is nil but it's not a SkipLease command and we want to proceed)

This is no longer necessary with the changes to restore the previous checks for non-epoch-based commands.


pkg/storage/replica.go, line 3248 at r9 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Move this comment to the definition of verifyLease.

Done.


pkg/storage/replica_command.go, line 1638 at r9 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

We need to keep this in to avoid a freeze-cluster. (OK, so it's a long shot that any log entries this old will still be around, but still. We'll do a big purge of this kind of stuff after propEvalKV is in)

Done.


pkg/storage/store.go, line 830 at r9 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This pattern means that it's impossible for StoreConfig to set EnableEpochRangeLeases to false when the env var (or default) is true. That didn't matter much for EnableCoaleascedHeartbeats but it's going to break the new "switcheroo" test you added when we flip the default.

Any suggestions? I considered an enum, and creating a *bool. But finally seemed just simpler to lookup the value in the env in server/server.go. I assume we're planning to remove this setting entirely at some point, right?


pkg/storage/storagebase/proposer_kv.proto, line 121 at r9 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Add "If the command was proposed prior to the introduction of epoch leases, origin_lease will be nil, but the combination of origin_replica and timestamp are sufficient to verify an expiration-based lease."

Done.


Comments from Reviewable

@andreimatei
Copy link
Contributor

:lgtm:


Review status: 35 of 46 files reviewed at latest revision, 31 unresolved discussions, some commit checks failed.


pkg/storage/replica.go, line 3213 at r7 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

OK, I was missing something there.

Ack, but I think the TODO should be more generic, explaining that we need to verify some lease, not call for removal of isSkipLease. isSkipLease is required on the proposer side because of the technicalities of how redirectOnOrAcquire works - RequestLease and LeaseTransferRequest can't use it.


pkg/storage/replica_range_lease.go, line 45 at r7 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

The reader is easily scared. requiresExpiringLeaseLocked certainly does support switching. You simply change the value of the enableEpochRangeLeases. This is necessary to support @petermattis' request that we be able to enable and disable this in prod. There is now a unittest for this as well.

OK, but then I think you should hint that the "switching" happens between server restarts, otherwise it sounds like it's a more dynamic decision.


pkg/storage/replica_range_lease.go, line 295 at r7 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Done.

nit: I find the comment confusing. "currently held lease at the specified ts" seems to be contradictory.
I think "replica's current lease in relationship with timestamp" seems more suggestive to me.


pkg/storage/replica_range_lease.go, line 269 at r8 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

I really hate adding new member variables to Replica.

well, but if it is a constant...


Comments from Reviewable

@spencerkimball spencerkimball force-pushed the spencerkimball/epoch-range-leases branch from 4a9f024 to c620d4a Compare December 7, 2016 17:51
@spencerkimball
Copy link
Member Author

Review status: 35 of 46 files reviewed at latest revision, 31 unresolved discussions, some commit checks failed.


pkg/storage/replica.go, line 3213 at r7 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Ack, but I think the TODO should be more generic, explaining that we need to verify some lease, not call for removal of isSkipLease. isSkipLease is required on the proposer side because of the technicalities of how redirectOnOrAcquire works - RequestLease and LeaseTransferRequest can't use it.

Done.


pkg/storage/replica_range_lease.go, line 45 at r7 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

OK, but then I think you should hint that the "switching" happens between server restarts, otherwise it sounds like it's a more dynamic decision.

Done.


pkg/storage/replica_range_lease.go, line 295 at r7 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

nit: I find the comment confusing. "currently held lease at the specified ts" seems to be contradictory.
I think "replica's current lease in relationship with timestamp" seems more suggestive to me.

Done.


pkg/storage/replica_range_lease.go, line 269 at r8 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

well, but if it is a constant...

OK, done.


Comments from Reviewable

@spencerkimball spencerkimball force-pushed the spencerkimball/epoch-range-leases branch from c620d4a to eb4ccef Compare December 7, 2016 17:53
@bdarnell
Copy link
Contributor

bdarnell commented Dec 8, 2016

Reviewed 5 of 11 files at r10, 13 of 13 files at r11.
Review status: all files reviewed at latest revision, 24 unresolved discussions, all commit checks successful.


pkg/storage/store.go, line 830 at r9 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Any suggestions? I considered an enum, and creating a *bool. But finally seemed just simpler to lookup the value in the env in server/server.go. I assume we're planning to remove this setting entirely at some point, right?

Yeah, we'll eventually remove it. I don't have any good suggestions; all the alternatives seem ugly.


Comments from Reviewable

Introduce new epoch-based range leases. These are designed to use
the same machinery as the expiration-based leases but use epochs from
the node liveness table instead of expirations.

The same Lease protobuf is utilized for both types of leases, but
there's now an optional Epoch. Previously, the lease proto had a "stasis"
timestamp that's now removed and replaced by logic in the replica to
evaluate the state of a lease.

In order to evaluate whether a lease is valid at command apply time
(downstream of Raft), we evaluate the lease upstream of Raft and send
it with every Raft command to be compared to the lease at apply time.

See: https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/range_leases.md

Epoch-based leases can be enabled or disabled with the
`COCKROACH_ENABLE_EPOCH_LEASES` environment variable.

This change fixes a previously possible loophole in lease verification
for expiration-based leases. In this scenario, a node could propose a
command with an older lease, transfer it away, receive a new lease,
and then execute the command.
@spencerkimball spencerkimball force-pushed the spencerkimball/epoch-range-leases branch from eb4ccef to 4c2b798 Compare December 8, 2016 18:16
@spencerkimball spencerkimball merged commit 2b9df0d into master Dec 8, 2016
@spencerkimball spencerkimball deleted the spencerkimball/epoch-range-leases branch December 8, 2016 18:32
benesch added a commit to benesch/cockroach that referenced this pull request Oct 18, 2018
If a store receives a snapshot that overlaps an existing replica, we
take it as a sign that the local replica may no longer be a member of
its range and queue it for processing in the replica GC queue.

When this code was added (cockroachdb#10426), the replica GC queue was quite
aggressive about processing replicas, and so the implementation was
careful to only queue a replica if it looked "inactive." Unfortunately,
this inactivity check rotted when epoch-based leases were introduced a
month later (cockroachdb#10305). An inactive replica with an epoch-based lease can
incorrectly be considered active, even if all of the other members of
the range have stopped sending it messages, because the epoch-based
lease will continue to be heartbeated by the node itself. (With an
expiration-based lease, the replica's local copy of the lease would
quickly expire if the other members of the range stopped sending it
messages.)

Fixing the inactivity check to work with epoch-based leases is rather
tricky. Quiescent replicas are nearly indistinguishable from
abandoned replicas. This commit just removes the inactivity check and
unconditionally queues replicas for GC if they intersect an incoming
snapshot. The replica GC queue check is relatively cheap (one or two
meta2 lookups), and overlapping snapshot situations are not expected to
last for very long.

Release note: None
benesch added a commit to benesch/cockroach that referenced this pull request Oct 18, 2018
If a store receives a snapshot that overlaps an existing replica, we
take it as a sign that the local replica may no longer be a member of
its range and queue it for processing in the replica GC queue.

When this code was added (cockroachdb#10426), the replica GC queue was quite
aggressive about processing replicas, and so the implementation was
careful to only queue a replica if it looked "inactive." Unfortunately,
this inactivity check rotted when epoch-based leases were introduced a
month later (cockroachdb#10305). An inactive replica with an epoch-based lease can
incorrectly be considered active, even if all of the other members of
the range have stopped sending it messages, because the epoch-based
lease will continue to be heartbeated by the node itself. (With an
expiration-based lease, the replica's local copy of the lease would
quickly expire if the other members of the range stopped sending it
messages.)

Fixing the inactivity check to work with epoch-based leases is rather
tricky. Quiescent replicas are nearly indistinguishable from
abandoned replicas. This commit just removes the inactivity check and
unconditionally queues replicas for GC if they intersect an incoming
snapshot. The replica GC queue check is relatively cheap (one or two
meta2 lookups), and overlapping snapshot situations are not expected to
last for very long.

Release note: None
benesch added a commit to benesch/cockroach that referenced this pull request Oct 19, 2018
If a store receives a snapshot that overlaps an existing replica, we
take it as a sign that the local replica may no longer be a member of
its range and queue it for processing in the replica GC queue.

When this code was added (cockroachdb#10426), the replica GC queue was quite
aggressive about processing replicas, and so the implementation was
careful to only queue a replica if it looked "inactive." Unfortunately,
this inactivity check rotted when epoch-based leases were introduced a
month later (cockroachdb#10305). An inactive replica with an epoch-based lease can
incorrectly be considered active, even if all of the other members of
the range have stopped sending it messages, because the epoch-based
lease will continue to be heartbeated by the node itself. (With an
expiration-based lease, the replica's local copy of the lease would
quickly expire if the other members of the range stopped sending it
messages.)

Fixing the inactivity check to work with epoch-based leases is rather
tricky. Quiescent replicas are nearly indistinguishable from
abandoned replicas. This commit just removes the inactivity check and
unconditionally queues replicas for GC if they intersect an incoming
snapshot. The replica GC queue check is relatively cheap (one or two
meta2 lookups), and overlapping snapshot situations are not expected to
last for very long.

Release note: None
benesch added a commit to benesch/cockroach that referenced this pull request Oct 21, 2018
If a store receives a snapshot that overlaps an existing replica, we
take it as a sign that the local replica may no longer be a member of
its range and queue it for processing in the replica GC queue.

When this code was added (cockroachdb#10426), the replica GC queue was quite
aggressive about processing replicas, and so the implementation was
careful to only queue a replica if it looked "inactive." Unfortunately,
this inactivity check rotted when epoch-based leases were introduced a
month later (cockroachdb#10305). An inactive replica with an epoch-based lease can
incorrectly be considered active, even if all of the other members of
the range have stopped sending it messages, because the epoch-based
lease will continue to be heartbeated by the node itself. (With an
expiration-based lease, the replica's local copy of the lease would
quickly expire if the other members of the range stopped sending it
messages.)

Fixing the inactivity check to work with epoch-based leases is rather
tricky. Quiescent replicas are nearly indistinguishable from
abandoned replicas. This commit just removes the inactivity check and
unconditionally queues replicas for GC if they intersect an incoming
snapshot. The replica GC queue check is relatively cheap (one or two
meta2 lookups), and overlapping snapshot situations are not expected to
last for very long.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants