Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stability: implement node liveness; first step towards new range leases #9530

Merged
merged 1 commit into from
Oct 12, 2016

Conversation

spencerkimball
Copy link
Member

@spencerkimball spencerkimball commented Sep 25, 2016

This change adds a node liveness table as a global system table.
Nodes periodically write updates to their liveness record by doing
a conditional put to the liveness table. The leader of the range
containing the node liveness table gossips the latest information
to the rest of the system.

Each node has a NodeLiveness object which can be used to query
the status of any other node to find out if it's live or non-live
according to the liveness threshold duration compared to the last
time it successfully heartbeat its liveness record.

The as-yet-unused IncrementEpoch mechanism is also added in this
PR, for eventual use with the planned epoch-based range leader leases.

Updated the range leader lease RFC to reflect current thinking.


This change is Reviewable

@@ -353,7 +344,8 @@ func (ds *DistSender) sendRPC(
ba roachpb.BatchRequest,
) (*roachpb.BatchResponse, error) {
if len(replicas) == 0 {
return nil, noNodeAddrsAvailError{}
return nil, roachpb.NewSendError(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change allows the system to properly retry such errors instead of failing the request to the distributed sender.

@spencerkimball spencerkimball force-pushed the spencerkimball/node-liveness branch from 01f8b9e to 6467402 Compare September 25, 2016 21:26
Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks like a good change to me, although I still don't think we should include this under the code yellow. To my mind it's out of scope for the stability crisis so we should either end the code yellow first or revise our policies to make them more sustainable for a code yellow that's going to go on for a longer time (i.e. in either case we should merge the master and develop branches before this change goes in).

node lease record to ensure that the epoch counter is consistent and
the start time is greater than the prior range lease holder’s node
lease expiration (plus the maximum clock offset).
and a node heartbeat timestamp.
Copy link
Contributor

@bdarnell bdarnell Sep 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storing expiration timestamps rather than last-heartbeat timestamps is generally better because it makes it possible to change the heartbeat interval and lease timeout. (The expiration timestamp is a promise by that node not to serve any commands after expiration, so it relies only on that node's timing parameters. If each node independently computes the expiration based on the last heartbeat, then all nodes must agree about the lease timeout and it becomes very difficult to ever change).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm amending this to including a duration as well as the last heartbeat.

liveness table. They do not consult their outlook on the liveness
table and can even be disconnected from gossip.

[NB: previously this RFC recommended a distributed transaction to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RFC template has an "alternatives" section for this kind of thing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't rise to the level of an alternative to the proposed design. I just wanted to provide a complete description of why a distributed txn with the liveness record isn't necessary for posterity. I've moved the bulk of this comment down into a subsection of Alternatives and added a nota bene link.

an HLC clock time corresponding to the now-old epoch at which it
acquired the lease.]

If gossip cannot be reached by a node, then it will lose the ability
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does gossip matter? Isn't it enough that the node was able to execute its conditional put to increase its timestamp?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually doesn't. As long as the node can heartbeat, it's set. I removed this paragraph.

range lease is being removed. `AdminTransferLease` will be enhanced to
perform transfers correctly using node lease style range leases.

Nodes which propose or transfer an epoch-based leader lease must
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throughout: we call it the "range lease", not the "leader lease" now, to avoid confusion with raft leadership.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Changed in the code now too.

3| 3| 1| -
4| 2| 19| -

On write to range 4, leader lease is invalid. To get a new lease, a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this first sentence mean?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this section now that I added the liveness table section at Vivek's request.

verifyLiveness(t, mtc)
mtc.stopStore(0)

// Create a new gossip instance and connect it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you just make two stores in the multiTestContext instead of creating a second gossip instance by hand?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewritten to avoid use of a new gossip and also uses two nodes to verify we're sending more than just the first.

// must hold the lease to a range which contains some or all of the
// node liveness records. After scanning the records, it checks
// against what's already in gossip and only gossips records which
// are out of date.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be simpler to have gossip do this check itself.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which check?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gossip.AddInfo could check and see if the new value is already present, and if so do nothing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, moved this check into gossip and added a test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I had to back out this change. Too many tests rely on the existing behavior that we re-gossip identical data. This just isn't worth figuring out for this PR.

ba.Timestamp = r.store.Clock().Now()
ba.Add(&roachpb.ScanRequest{Span: span})
br, trigger, pErr :=
r.executeBatch(r.ctx, storagebase.CmdIDKey(""), r.store.Engine(), nil, ba)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this use r.Send instead of going directly to executeBatch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the pattern used in gossiping the system span and @tschottdorf encouraged that I continue following it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this should be consistent between the two, but I can't think of why it should be this way. Maybe add a TODO to change both of them to use Replica.Send (or document why we need this more complicated process).

Copy link
Member Author

@spencerkimball spencerkimball Sep 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out we need to do this to avoid reentry on the same key in the command queue, which blocks this from completion. Added comments in both locations.

br, trigger, pErr :=
r.executeBatch(r.ctx, storagebase.CmdIDKey(""), r.store.Engine(), nil, ba)
if pErr != nil {
log.Errorf(r.ctx, "couldn't scan node liveness records in span %s: %s", span, pErr.GoError())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should probably return instead of logging and continuing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Done.

// UserDataSpan is the non-meta and non-structured portion of the key space.
UserDataSpan = roachpb.Span{Key: SystemMax, EndKey: TableDataMin}

// SystemDataSpans are spans which contain system data which needs to be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/SystemDataSpans/GossipedSystemSpans/g

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

timestamp greater than the latest expiration timestamp it has written
to the node lease table.
timestamp greater than the latest heartbeat timestamp plus a liveness
threshold duration. Note that the heartbeat is specified according to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are trying to express something like spanner's "disjointed invariant" here. It will be better to say that the lease expires after the heartbeat-timestamp + liveness-threshold-duration . Not clear what this liveness-threshold-duration is?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, reworded slightly. The liveness threshold duration is computed based on the raft election timeout and a constant multiple, defined in the code.

@@ -47,28 +47,20 @@ its replicas holding range leases at once?

We introduce a new node lease table at the beginning of the keyspace
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you specify the columns in this table and describe them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@spencerkimball spencerkimball force-pushed the spencerkimball/node-liveness branch 2 times, most recently from 8f62737 to fafaa3c Compare September 26, 2016 22:27
Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've only reviewed the change to the RFC so far. I'll get to the code tomorrow.

to set the new leaseholder or else set the leaseholder to 0. This is
necessary in the case of rebalancing when the node that holds the
range lease is being removed. `AdminTransferLease` will be enhanced to
perform transfers correctly using node lease style range leases.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/node lease style/epoch-based/g

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

themselves be live according to the liveness table. Keep in mind that
a node considers itself live according to whether it has successfully
written a recent liveness record which proves its liveness measured
by current time vs the record's expiration module the maximum clock
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/module/modulo/g

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

details on why that's unnecessary.]

In addition to nodes updating their own liveness entry with ongoing
updates via conditional puts, non-leaseholder nodes may increment
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"updating...with ongoing updates" is one too many "updates".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

In addition to nodes updating their own liveness entry with ongoing
updates via conditional puts, non-leaseholder nodes may increment
the epoch of a node which has failed to update its heartbeat in time
to keep it younger than the threshold liveness duration.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"...keep it younger than the expiration time".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

With 1,000 nodes and a 9s liveness duration threshold, we expect every
node to do a conditional put to update the heartbeat timestamp every
7.2s. That would correspond to ~140 reqs/second, a not-unreasonable
load for this function.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elsewhere in the document we talk about a 10,000 range table requiring 1,388 Raft commits per second. It would be nice to reiterate how many Raft ops/sec the old (non-epoch-based range leases) would require for a 1,000 node cluster with 10,000 replicas per node (3.3 million ranges).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

We still require the traditional expiration-based range leases for any
ranges located at or before the liveness table's range. This might be
problematic in the case of meta2 address record ranges, which are
expected to proliferate in a large cluster. This lease traffic could
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to put a number to the number of ranges which couldn't use epoch-based range leases. How many meta2 ranges will there be for the 3.3 million range cluster example I mention? I'm not sure how big RangeDescriptors currently are (we should measure this), but let's assume 1KB per range of meta2 data. A single meta2 range could hold 64K which would translate into 50 meta2 ranges. That doesn't seem worth optimizing.

Update: just checked on a local cluster and the size of a meta2 key+value averaged 184 bytes. That would translate into 364K ranges per meta2 range and we'd need 10 meta2 ranges to support 3.3 million ranges. You should probably double check my math here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well you end up with some historical versions and other dross, so I agree somewhere between 10 and 50 meta2 ranges. Doesn't seem worth the effort. I added a note.

## Use of distributed txn for updating liveness records

The original proposal mentioned: "The range lease record is always
updated in a distributed transaction with the -node lease record to
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/-node lease/node-lease/g

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to node liveness table


The original proposal mentioned: "The range lease record is always
updated in a distributed transaction with the -node lease record to
ensure that the epoch counter is consistent and -the start time is
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/-the/the/g

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

TODO(peter): What is the motivation for gossipping the node lease
table? Gossipping means the node's will have out of date info for the
TODO(peter): What is the motivation for gossiping the node lease
table? Gossiping means the node's will have out of date info for the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind addressing this TODO? @bdarnell gave justification elsewhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// expiration-based range leases instead of the more efficient
// node-liveness epoch-based range leases (see
// https://github.com/cockroachdb/cockroach/blob/develop/docs/RFCS/range_leases.md)
NodeLivenessPrefix = roachpb.Key(makeKey(SystemPrefix, roachpb.RKey("\x00liveness-")))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we almost never see raw keys anymore, is there a reason for the key to be so long? Probably doesn't matter (as there is only one liveness key per node), but it grates me a little.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's only as long as the others.

@spencerkimball spencerkimball force-pushed the spencerkimball/node-liveness branch from fafaa3c to 97ad64b Compare September 27, 2016 13:14
record to increase the expiration timestamp and ensure that the epoch
has not changed. If the epoch does change, *all* of the range leases
held by this node are revoked. A node *must not* propose any commands
with a timestamp greater than its expiration timestamp modulo the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/modulo/minus/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Each node periodically performs a conditional put to its node liveness
record to increase the expiration timestamp and ensure that the epoch
has not changed. If the epoch does change, *all* of the range leases
held by this node are revoked. A node *must not* propose any commands
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add "or serve any reads" at the end of this line.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -152,6 +152,7 @@ message ChangeReplicasTrigger {
// This can be used to trigger scan-and-gossip for the given span.
message ModifiedSpanTrigger {
optional bool system_config_span = 1 [(gogoproto.nullable) = false];
optional Span node_liveness_span = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, add a comment here about the difference.

// than a threshold liveness duration.
int64 epoch = 2;
// The timestamp at which this liveness record expires.
util.hlc.Timestamp expiration = 3 [(gogoproto.nullable) = false];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Store both the expiration and the stasis time (or the max offset), to make it possible to change the max offset safely.

Copy link
Member Author

@spencerkimball spencerkimball Sep 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed this same pattern in the range lease proto. It seems like a bandaid on a gunshot wound. The reality is you simply will not be able to jigger max offsets without a stop-the-world cluster restart and a delay of at least the maximum lease expiration where max offset might matter. Just because the writer of this record and the reader(s) agree on the stasis timestamp doesn't protect from the inconsistencies which could arise if max offset were changed. To see why, consider the following scenario:

  • Node A has a max offset = 10ns, accurate clock
  • Node A sets liveness record expiration = @100ns, stasis = @90ns
  • Node B, with max offset = 20ns and a fast clock by 20ns, bumps Node A's epoch @101ns (which is @81ns, according to Node A's clock)
  • Node A & B can both believe they have leases

I think this "stasis" timestamp concept should be abolished in light of it being confusing, cognitively complex, and ultimately nothing but a Potemkin protection. We should be writing code with the assumption that all nodes agree on max offset. I don't believe we can do otherwise without kidding ourselves about guarantees.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we should avoid relying on the stasis timestamp as much as we can. On current master, it's necessary to use it and it was deemed more explicit to have it in the protos than pulling it from configuration, where it's more obvious that it does matter. I agree that shortening the MaxOffset is a can of worms we don't want to open now. Instead I think we should improve the detection of an obvious misconfiguration by storing the stasis timestamp and letting nodes verify that it agrees with what they believe the MaxOffset to be.

I'm just starting to look at this change, so excuse any lack of global context,.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, my analysis is wrong as long as both Node A and Node B use their respective clock offsets in relation to the stasis time (i.e. Node B should consider expiration to be @110NS). But that means we only need one timestamp. Node A stops serving reads at Expiration; Node A only considers a lease or liveness record expired at Expiration + MaxOffset.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, actually I think we do need to send the MaxOffset because the case I just mentioned only holds if Node B's MaxOffset is greater than or equal to Node A's. What we need is Node A to stop serving reads at Expiration and Node B to consider a lease or liveness record expired only after Expiration + Max(MaxOffset[Node B], MaxOffset[Node A]).

But again, this just is begging the question of whether we ought to support changing max offset without requiring a cluster freeze. Is this kind of complexity healthy when you weigh costs and benefits?

// computed from the specified tick interval and the default number of
// election timeout ticks.
func RaftElectionTimeout(raftTickInterval time.Duration) time.Duration {
return time.Duration(defaultRaftElectionTimeoutTicks) * raftTickInterval
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RaftElectionTimeoutTicks can be changed; we shouldn't hard-code the default here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

@spencerkimball spencerkimball force-pushed the spencerkimball/node-liveness branch from 97ad64b to e8dd5da Compare September 28, 2016 17:21
@tbg
Copy link
Member

tbg commented Sep 28, 2016

Reviewed 1 of 9 files at r5.
Review status: 1 of 36 files reviewed at latest revision, 44 unresolved discussions, some commit checks failed.


docs/RFCS/range_leases.md, line 107 at r3 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Done

The last part of the sentence isn't easy to parse, and what is the "heartbeat"? It's not in the schema.

docs/RFCS/range_leases.md, line 66 at r5 (raw file):

record to increase the expiration timestamp and ensure that the epoch
has not changed. If the epoch does change, *all* of the range leases
held by this node are revoked. A node *must not* propose any commands

Isn't this all about reads? What does it matter if the node proposes anything without a lease? It should try to avoid it, but since there is still a lease persisted in the replica state, all of those proposals would simply error out. Or am I missing something here?


docs/RFCS/range_leases.md, line 73 at r5 (raw file):

range which contains it. Gossip is used in order to minimize fanout
and make distribution efficient. The best-effort nature of gossip is
acceptable here because timely delivery of node livesness updates are

/lives/live/


docs/RFCS/range_leases.md, line 94 at r5 (raw file):

Nodes which propose or transfer an epoch-based range lease must
themselves be live according to the liveness table. Keep in mind that

What exactly does this mean? I could be live at propose time but not live at the time the proposal actually goes through. Explain what exactly is meant and what is required for correctness.


docs/RFCS/range_leases.md, line 100 at r5 (raw file):

offset.

To propose an epoch-based range lease, the existing lease must either

This sentence is hard to read. Please make that a little more digestible as it's pretty central.


docs/RFCS/range_leases.md, line 204 at r5 (raw file):

succeed at the range lease, but then fail immediately on attempting to
use the range lease, as it could not possibly still have an HLC clock
time corresponding to the now-old epoch at which it acquired the lease

Missing period.


Comments from Reviewable

@spencerkimball
Copy link
Member Author

Review status: 1 of 36 files reviewed at latest revision, 44 unresolved discussions, some commit checks failed.


docs/RFCS/range_leases.md, line 107 at r3 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

The last part of the sentence isn't easy to parse, and what is the "heartbeat"? It's not in the schema.

Removed this paragraph. All info here is contained elsewhere.

docs/RFCS/range_leases.md, line 66 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Isn't this all about reads? What does it matter if the node proposes anything without a lease? It should try to avoid it, but since there is still a lease persisted in the replica state, all of those proposals would simply error out. Or am I missing something here?

s/propose/accept/

docs/RFCS/range_leases.md, line 73 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

/lives/live/

Done.

docs/RFCS/range_leases.md, line 94 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

What exactly does this mean? I could be live at propose time but not live at the time the proposal actually goes through. Explain what exactly is meant and what is required for correctness.

Good catch. This was old verbiage and not required. Obviously a node will want to consider itself live when doing anything, but in the event its liveness update is delayed for any reason, not to worry. It's not relevant. The only way for an epoch-based range lease to be acquired is if the leader is set to 0, leader == acquirer & lease epoch == liveness epoch, or lease epoch is being incremented such that lease epoch == liveness epoch. I removed this paragraph.

docs/RFCS/range_leases.md, line 100 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

This sentence is hard to read. Please make that a little more digestible as it's pretty central.

Improved.

docs/RFCS/range_leases.md, line 204 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Missing period.

Done.

storage/node_liveness.go, line 143 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Then should we have a separate field in NodeLiveness for our own liveness, instead of using this map for both values received from gossip and the special one we generate for ourselves?

Hmm, good point. We probably don't want gossip overwriting the truth. Changed.

Comments from Reviewable

@tbg
Copy link
Member

tbg commented Sep 28, 2016

Review status: 1 of 36 files reviewed at latest revision, 43 unresolved discussions, some commit checks failed.


docs/RFCS/range_leases.md, line 66 at r5 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

s/propose/accept/

?

docs/RFCS/range_leases.md, line 73 at r5 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Done.

Publish-before-push

docs/RFCS/range_leases.md, line 100 at r5 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Improved.

Mind pinging when this is pushed? We usually do push-before-publish so that when I see your comments, I can also see your changes.

Comments from Reviewable

@spencerkimball spencerkimball force-pushed the spencerkimball/node-liveness branch from e8dd5da to bb05979 Compare September 28, 2016 21:23
@spencerkimball
Copy link
Member Author

Review status: 1 of 36 files reviewed at latest revision, 43 unresolved discussions, some commit checks failed.


docs/RFCS/range_leases.md, line 66 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

?

I meant that I changed the sentence to read "A node _must not_ accept any commands or serve any reads with a timestamp greater than its expiration minus the maximum clock offset."

docs/RFCS/range_leases.md, line 73 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Publish-before-push

Done.

docs/RFCS/range_leases.md, line 100 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Mind pinging when this is pushed? We usually do push-before-publish so that when I see your comments, I can also see your changes.

Done.

Comments from Reviewable

@tbg
Copy link
Member

tbg commented Sep 28, 2016

Reviewed 16 of 34 files at r1, 2 of 6 files at r2, 9 of 17 files at r3, 8 of 9 files at r5, 3 of 3 files at r6.
Review status: all files reviewed at latest revision, 58 unresolved discussions, some commit checks failed.


docs/RFCS/range_leases.md, line 61 at r5 (raw file):

NodeID|     node identifier
Epoch|      monotonically increasing liveness epoch
Expiration| timestamp at which the liveness record expires

Heartbeats are not mentioned here nor explained anywhere else though they are referenced and implemented. Please clarify throughout.


docs/RFCS/range_leases.md, line 66 at r5 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

I meant that I changed the sentence to read "A node must not accept any commands or serve any reads with a timestamp greater than its expiration minus the maximum clock offset."

Does that make sense? I assume that by "accept" you mean "apply" commands. But what is the effective "expiration timestamp"? That's not part of the state machine, so I can't use it for decision making. I would think that commands not proposed by the lease holder (which is a part of the replicated state, but doesn't have an expiration timestamp any more for your new type of lease) apply as a no-op (i.e. it error out). Is that what you mean? It would be good to "prove" that that covers all bases.

docs/RFCS/range_leases.md, line 97 at r6 (raw file):

is the leaseholder or the lease is expired.

An existing lease which uses the epoch-based mechanism may be acquired

When would that happen? Incrementing the epoch is covered in the third case, so here this would be the lease holder proposing an identical lease? Why?


keys/constants.go, line 201 at r5 (raw file):

  // NodeLivenessKeyMax is the maximum value for any node liveness key.
  NodeLivenessKeyMax = roachpb.Key(roachpb.RKey(NodeLivenessPrefix).PrefixEnd())

PrefixEnd() exists on Key as well.


kv/dist_sender.go, line 347 at r1 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

This change allows the system to properly retry such errors instead of failing the request to the distributed sender.

I believe this makes `noNodeAddrsAvailError` unused.

roachpb/data.proto, line 155 at r5 (raw file):

message ModifiedSpanTrigger {
  optional bool system_config_span = 1 [(gogoproto.nullable) = false];
  // The node liveness span can be used to indicate that node liveness

s/The node liveness span can be used/node_liveness_span is set/


roachpb/data.proto, line 156 at r5 (raw file):

  optional bool system_config_span = 1 [(gogoproto.nullable) = false];
  // The node liveness span can be used to indicate that node liveness
  // records need re-gossiping after modification or range leadership

avoid the word "leadership".


roachpb/data.proto, line 157 at r5 (raw file):

  // The node liveness span can be used to indicate that node liveness
  // records need re-gossiping after modification or range leadership
  // change. In the case of modification, the span is typically set to

s/In the case of modification,/When set,/


roachpb/data.proto, line 158 at r5 (raw file):

  // records need re-gossiping after modification or range leadership
  // change. In the case of modification, the span is typically set to
  // a single key. For range leadership change, the span is set to the

What does it mean when it's set to a single key? When does that happen and when the other case? Also same comment about word "leadership".


storage/client_test.go, line 490 at r5 (raw file):

      // querying all stores; it may not be present on all stores, but the
      // current version is guaranteed to be present on one of them.
      if str == nil {

What prompted this?


storage/liveness.proto, line 29 at r5 (raw file):

      (gogoproto.casttype) = "github.com/cockroachdb/cockroach/roachpb.NodeID"];
  // Epoch is a monotonically-increasing value for node liveness. It
  // may be incremented every time a node's last heartbeat is older

What is the heartbeat?


storage/node_liveness.go, line 65 at r2 (raw file):

      expiration = lr.received.Add(LivenessThreshold.Nanoseconds(), 0)
  }
  return nl.clock.Now().Less(expiration)

Looks like you could add LivenessThreshold here.


storage/node_liveness.go, line 117 at r2 (raw file):

// last heartbeat in the canonical node liveness table.
func (nl *NodeLiveness) StartHeartbeat(ctx context.Context, stopper *stop.Stopper) {
  log.VEventf(1, ctx, "starting liveness heartbeat for node %d", nl.gossip.GetNodeID())

NodeID should already be in the context.


storage/node_liveness.go, line 123 at r2 (raw file):

      for {
          if err := nl.heartbeat(ctx); err != nil {
              log.Errorf(ctx, "failed liveness heartbeat for node %d: %s", nl.gossip.GetNodeID(), err)

Ditto.


storage/node_liveness.go, line 212 at r2 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

I'm adding a TODO for this – I think it's better measured in the context of more metrics related to the higher level concept of range leases, which I'll add with next PR.

Why kick that can down the road? We certainly do want to track this and it's worth seeing how this behaves without waiting for the more complicated PRs that try to do much more at once.

storage/node_liveness.go, line 226 at r5 (raw file):

// for the node specified by nodeID. In the event that the conditional
// put fails, the handleCondFailed callback is invoked with the actual
// node liveness record. The conditional put is done as a 1PC

... and nil returned (which is odd interface. Consider returning a special error instead).


storage/node_liveness.go, line 241 at r5 (raw file):

      if oldLiveness == nil {
          b.CPut(key, newLiveness, nil)
      } else {

The two branches are equivalent.


storage/node_liveness_test.go, line 121 at r5 (raw file):

  }

  // Verify that the epoch has been advanced

nit: dot.


storage/node_liveness_test.go, line 134 at r5 (raw file):

      }
      if live, err := mtc.nodeLivenesses[0].IsLive(deadNodeID); live || err != nil {
          return errors.Errorf("expected dead node to remain dead after epoch increment: %s", err)

print live as well.


storage/node_liveness_test.go, line 165 at r5 (raw file):

  // Register a callback to gossip in order to verify liveness records
  // are re-gossiped.
  var mu syncutil.Mutex

nit:

var keysMu struct {
   sync.Mutex
   keys []string
}

storage/node_liveness_test.go, line 167 at r5 (raw file):

  var mu syncutil.Mutex
  var keys []string
  livenessRegex := gossip.MakePrefixPattern(gossip.KeyNodeLivenessPrefix)

Weird that KeyNodeLivenessPrefix is in gossip.


Comments from Reviewable

@spencerkimball spencerkimball force-pushed the spencerkimball/node-liveness branch from bb05979 to 1ee4a91 Compare September 28, 2016 22:36
@spencerkimball
Copy link
Member Author

Review status: all files reviewed at latest revision, 58 unresolved discussions, some commit checks failed.


docs/RFCS/range_leases.md, line 61 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Heartbeats are not mentioned here nor explained anywhere else though they are referenced and implemented. Please clarify throughout.

I now mention heartbeat at the start.

docs/RFCS/range_leases.md, line 66 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Does that make sense? I assume that by "accept" you mean "apply" commands. But what is the effective "expiration timestamp"? That's not part of the state machine, so I can't use it for decision making. I would think that commands not proposed by the lease holder (which is a part of the replicated state, but doesn't have an expiration timestamp any more for your new type of lease) apply as a no-op (i.e. it error out). Is that what you mean? It would be good to "prove" that that covers all bases.

Clarified further.

docs/RFCS/range_leases.md, line 97 at r6 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

When would that happen? Incrementing the epoch is covered in the third case, so here this would be the lease holder proposing an identical lease? Why?

I removed that case as it doesn't make sense.

keys/constants.go, line 201 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

PrefixEnd() exists on Key as well.

Done.

kv/dist_sender.go, line 347 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I believe this makes noNodeAddrsAvailError unused.

Yes, and it's been removed from the codebase entirely. Maybe I'm missing what you're comment was meant to say?

roachpb/data.proto, line 155 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

s/The node liveness span can be used/node_liveness_span is set/

Done.

roachpb/data.proto, line 156 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

avoid the word "leadership".

Done.

roachpb/data.proto, line 157 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

s/In the case of modification,/When set,/

Done.

roachpb/data.proto, line 158 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

What does it mean when it's set to a single key? When does that happen and when the other case? Also same comment about word "leadership".

Removed leadership, clarified comment.

storage/client_test.go, line 490 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

What prompted this?

Node liveness heartbeats go out pretty quickly, sometimes before we've gossiped the first range, which causes the multi test context to hit the case of stores which haven't yet been initialized.

storage/liveness.proto, line 29 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

What is the heartbeat?

updated to "expiration"

storage/node_liveness.go, line 65 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Looks like you could add LivenessThreshold here.

Not sure what you mean?

storage/node_liveness.go, line 117 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

NodeID should already be in the context.

Done.

storage/node_liveness.go, line 123 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Ditto.

Done.

storage/node_liveness.go, line 212 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Why kick that can down the road? We certainly do want to track this and it's worth seeing how this behaves without waiting for the more complicated PRs that try to do much more at once.

This isn't being used yet. It's just part of the API for the next change. I'll add metrics in that change which will make sense in the context of range leases.

storage/node_liveness.go, line 226 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

... and nil returned (which is odd interface. Consider returning a special error instead).

The point of the callback is to avoid getting any error. If I've specified the callback to handle a `ConditionalFailedError`, I don't want to then do a lot of the same work to verify the error type for the return.

storage/node_liveness.go, line 241 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

The two branches are equivalent.

They're not. It's the difference between `interface{}(nil)` and `*NodeLiveness(nil)` and it matters to the `Batch` API.

storage/node_liveness_test.go, line 121 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

nit: dot.

Done.

storage/node_liveness_test.go, line 134 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

print live as well.

Done.

storage/node_liveness_test.go, line 165 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

nit:

var keysMu struct {
   sync.Mutex
   keys []string
}
Done.

storage/node_liveness_test.go, line 167 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Weird that KeyNodeLivenessPrefix is in gossip.

...as are our other gossip keys. Doesn't seem weird.

Comments from Reviewable

@bdarnell
Copy link
Contributor

:lgtm:


Reviewed 16 of 34 files at r1, 2 of 6 files at r2, 4 of 17 files at r3, 6 of 9 files at r5, 8 of 8 files at r7.
Review status: all files reviewed at latest revision, 37 unresolved discussions, all commit checks successful.


server/server.go, line 210 at r7 (raw file):

  // liveness expiration and heartbeat interval.
  active, renewal := storage.RangeLeaseDurations(
      storage.RaftElectionTimeout(base.DefaultRaftTickInterval, 0 /* use default */))

This needs to use srvCtx.RaftTickInterval, not base.DefaultRaftTickInterval. And hard-coding zero here seems like a trap if we ever allow plumbing the election timeout through server.Context, so I'd rather go ahead and do that now so we can use it here. (or at least export the default constant and use it in place of 0 so we have something to grep for).


storage/node_liveness.go, line 75 at r2 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

I see this type more as the API into node liveness in the cluster. The node both publishes its liveness through it via the periodic heartbeat, queries other nodes' liveness, or increments other nodes' epochs to prove they were non-live. The tasks are all so closely related I don't see benefit in separating them.

And I don't see the benefit in keeping them together. The difference with the `self` liveness record and the entry for our node ID in gossip is an example of the traps that are possible from combining the two sides of this operation. It looks to me like there's very little cost in making this two separate types (in the same file).

storage/node_liveness.go, line 212 at r2 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

This isn't being used yet. It's just part of the API for the next change. I'll add metrics in that change which will make sense in the context of range leases.

I think we want to deploy this change on its own first to make sure the liveness table is working before we start relying on it for leases. Adding metrics here will be useful for that stage of deployment.

storage/node_liveness.go, line 226 at r5 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

The point of the callback is to avoid getting any error. If I've specified the callback to handle a ConditionalFailedError, I don't want to then do a lot of the same work to verify the error type for the return.

I understand that you don't want to both run the callback and return an error, but this is still an odd interface. A callback that is only called just before return seems like it would be more naturally handled via a special return value (although it does get ugly - I wouldn't want to return a special error type here for the client to handle, but returning a `Liveness` that is only present on conditional put failure would be strange). I can't come up with a design that I clearly like better than this one so it's not worth obsessing over.

storage/node_liveness.go, line 241 at r5 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

They're not. It's the difference between interface{}(nil) and *NodeLiveness(nil) and it matters to the Batch API.

This could use a comment then.

storage/replica.go, line 3404 at r2 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Turns out we need to do this to avoid reentry on the same key in the command queue, which blocks this from completion. Added comments in both locations.

Hmm, seems strange. But not worth worrying about for this PR.

Comments from Reviewable

@tbg
Copy link
Member

tbg commented Sep 29, 2016

Reviewed 8 of 8 files at r7.
Review status: all files reviewed at latest revision, 41 unresolved discussions, all commit checks successful.


docs/RFCS/range_leases.md, line 66 at r5 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Clarified further.

Define what it means to be the "range lease holder" before this comment (in particular, is it a property tied o a HLC timestamp of a command, or a real-time measure, or a mixture of both?). Also, explain what happens at _application_ time of proposed commands. Surely they will error out when the node in the lease isn't equal to the proposing node, but what about the epochs when the node _does_ match? Is it at all a correctness issue to propose at any time not "covered" by the lease (whatever "covered" means since now there is a "real-time duration notion" and a command's HLC timestamp)? I also still would like to see some examples here. Seeing that we're about to update the design doc, the work will well be worth it.

docs/RFCS/range_leases.md, line 68 at r7 (raw file):

does change, *all* of the range leases held by this node are
revoked. A node can only execute commands (propose writes to Raft or
serve reads) if it's the range leaseholder, the range lease epoch is

s/leaseholder/lease holder/ (I think).


kv/dist_sender.go, line 347 at r1 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Yes, and it's been removed from the codebase entirely. Maybe I'm missing what you're comment was meant to say?

Sorry, I didn't notice that it had been removed, I thought there was cruft left you could tear out.

roachpb/data.proto, line 157 at r5 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Done.

s/node's/nodes/

roachpb/data.proto, line 158 at r5 (raw file):

For range lease updates, the span is set to the entire node liveness key range.

"range lease updates" is not very specific. Maybe

On lease activity on the range containing the liveness span, all of the liveness information is (re-)gossiped by means of specifying the entire node liveness key range.

though that's also a bit unwieldy.


storage/client_test.go, line 490 at r5 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Node liveness heartbeats go out pretty quickly, sometimes before we've gossiped the first range, which causes the multi test context to hit the case of stores which haven't yet been initialized.

Add a TODO to avoid the next person wasting their life away trying to figure out what you just said.

storage/liveness.proto, line 33 at r4 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Ah, actually I think we do need to send the MaxOffset because the case I just mentioned only holds if Node B's MaxOffset is greater than or equal to Node A's. What we need is Node A to stop serving reads at Expiration and Node B to consider a lease or liveness record expired only after Expiration + Max(MaxOffset[Node B], MaxOffset[Node A]).

But again, this just is begging the question of whether we ought to support changing max offset without requiring a cluster freeze. Is this kind of complexity healthy when you weigh costs and benefits?

I don't think changing MaxOffset should be a goal for anytime in the near future.

storage/liveness.proto, line 29 at r7 (raw file):

      (gogoproto.casttype) = "github.com/cockroachdb/cockroach/roachpb.NodeID"];
  // Epoch is a monotonically-increasing value for node liveness. It
  // may be incremented every time a node's expiration is older than a

s/every time/when/

What's a threshold liveness duration? I would think that the conditions for updating is that the HLC of the incrementor is ahead of the old expiration timestamp (at the time the action is taken), and (correspondingly, of course) that the new expiration timestamp is larger than the old one.


storage/node_liveness.go, line 65 at r2 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Not sure what you mean?

The comment referred to code that has since been removed.

storage/node_liveness.go, line 226 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I understand that you don't want to both run the callback and return an error, but this is still an odd interface. A callback that is only called just before return seems like it would be more naturally handled via a special return value (although it does get ugly - I wouldn't want to return a special error type here for the client to handle, but returning a Liveness that is only present on conditional put failure would be strange). I can't come up with a design that I clearly like better than this one so it's not worth obsessing over.

If this interface stays, comment about the nil return please.

storage/node_liveness_test.go, line 167 at r5 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

...as are our other gossip keys. Doesn't seem weird.

Still strikes me as odd, but not worth changing here.

storage/replica.go, line 3404 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Hmm, seems strange. But not worth worrying about for this PR.

Why strange? We call in here synchronously from a commit trigger.

Comments from Reviewable

@bdarnell
Copy link
Contributor

Review status: all files reviewed at latest revision, 41 unresolved discussions, all commit checks successful.


storage/replica.go, line 3404 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Why strange? We call in here synchronously from a commit trigger.

From a commit trigger we should be using low-level (MVCC) APIs, or kicking off a goroutine to use the high-level (Send) API. Starting in the middle (executeBatch) is weird and seems dangerous - it doesn't matter for the way these tables are being used now, but if we're not blocking in the command queue then we don't interact with other transactions in the same way as normal reads.

Comments from Reviewable

@tbg
Copy link
Member

tbg commented Sep 29, 2016

Review status: all files reviewed at latest revision, 41 unresolved discussions, all commit checks successful.


storage/replica.go, line 3404 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

From a commit trigger we should be using low-level (MVCC) APIs, or kicking off a goroutine to use the high-level (Send) API. Starting in the middle (executeBatch) is weird and seems dangerous - it doesn't matter for the way these tables are being used now, but if we're not blocking in the command queue then we don't interact with other transactions in the same way as normal reads.

I agree. I meant that it's not strange that this deadlocks with `r.Send` as-is.

Comments from Reviewable

circular dependencies. This table maps node IDs to an epoch counter,
and an expiration timestamp.

## Liveness table
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Node liveness table

liveness updates will simply resort to a conditional put to increment
a seemingly not-live node's liveness epoch. The conditional put will
fail because the expected value is out of date and the correct liveness
info is returned to the caller.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph is better placed further down. Perhaps after the next two paragraphs

epoch. If a node is down (and its node lease has expired), another
node may revoke its lease(s) by incrementing the node lease
epoch. If a node is down (and its node liveness has expired), another
node may revoke its lease(s) by incrementing the node liveness
epoch. Once this is done the old range lease is invalidated and a new
node may claim the range lease.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's nice to be explicit here about the disjointed invariant. A range lease can move from node1 to node2 only after the node1's liveness record has expired, and node2 has a valid unexpired liveness epoch.

epoch. If a node is down (and its node lease has expired), another
node may revoke its lease(s) by incrementing the node lease
epoch. If a node is down (and its node liveness has expired), another
node may revoke its lease(s) by incrementing the node liveness
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which node liveness?

@spencerkimball
Copy link
Member Author

Review status: all files reviewed at latest revision, 45 unresolved discussions, all commit checks successful.


docs/RFCS/range_leases.md, line 66 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Define what it means to be the "range lease holder" before this comment (in particular, is it a property tied o a HLC timestamp of a command, or a real-time measure, or a mixture of both?). Also, explain what happens at application time of proposed commands. Surely they will error out when the node in the lease isn't equal to the proposing node, but what about the epochs when the node does match? Is it at all a correctness issue to propose at any time not "covered" by the lease (whatever "covered" means since now there is a "real-time duration notion" and a command's HLC timestamp)?
I also still would like to see some examples here. Seeing that we're about to update the design doc, the work will well be worth it.

I added another couple of paragraphs. Not sure what you're looking for with examples. Perhaps you could suggest some?

docs/RFCS/range_leases.md, line 56 at r7 (raw file):

Previously, vivekmenezes wrote…

Node liveness table

Done.

docs/RFCS/range_leases.md, line 68 at r7 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

s/leaseholder/lease holder/ (I think).

leaseholder is a word.

docs/RFCS/range_leases.md, line 80 at r7 (raw file):

Previously, vivekmenezes wrote…

This paragraph is better placed further down. Perhaps after the next two paragraphs

Done.

docs/RFCS/range_leases.md, line 84 at r7 (raw file):

Previously, vivekmenezes wrote…

which node liveness?

Done.

docs/RFCS/range_leases.md, line 86 at r7 (raw file):

Previously, vivekmenezes wrote…

I think it's nice to be explicit here about the disjointed invariant. A range lease can move from node1 to node2 only after the node1's liveness record has expired, and node2 has a valid unexpired liveness epoch.

Done.

kv/dist_sender.go, line 347 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Sorry, I didn't notice that it had been removed, I thought there was cruft left you could tear out.

Done.

roachpb/data.proto, line 157 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

s/node's/nodes/

Done.

roachpb/data.proto, line 158 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

For range lease updates, the span is set to the entire node liveness key range.

"range lease updates" is not very specific. Maybe

On lease activity on the range containing the liveness span, all of the liveness information is (re-)gossiped by means of specifying the entire node liveness key range.

though that's also a bit unwieldy.

Done.

server/server.go, line 210 at r7 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This needs to use srvCtx.RaftTickInterval, not base.DefaultRaftTickInterval. And hard-coding zero here seems like a trap if we ever allow plumbing the election timeout through server.Context, so I'd rather go ahead and do that now so we can use it here. (or at least export the default constant and use it in place of 0 so we have something to grep for).

OK, it's in `server.Context` now instead of using 0, though it's currently not settable via command line in any way.

storage/client_test.go, line 490 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Add a TODO to avoid the next person wasting their life away trying to figure out what you just said.

Done.

storage/liveness.proto, line 33 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I don't think changing MaxOffset should be a goal for anytime in the near future.

Right. Sent PR #9612

storage/liveness.proto, line 29 at r7 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

s/every time/when/

What's a threshold liveness duration? I would think that the conditions for updating is that the HLC of the incrementor is ahead of the old expiration timestamp (at the time the action is taken), and (correspondingly, of course) that the new expiration timestamp is larger than the old one.

Done.

storage/node_liveness.go, line 65 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

The comment referred to code that has since been removed.

Done.

storage/node_liveness.go, line 75 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

And I don't see the benefit in keeping them together. The difference with the self liveness record and the entry for our node ID in gossip is an example of the traps that are possible from combining the two sides of this operation. It looks to me like there's very little cost in making this two separate types (in the same file).

I dislike the idea of making two objects related to the contents of the node liveness table. Two objects for server, two objects to pass into stores, two objects for all of the unittests.

storage/node_liveness.go, line 212 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I think we want to deploy this change on its own first to make sure the liveness table is working before we start relying on it for leases. Adding metrics here will be useful for that stage of deployment.

OK, I added metrics.

storage/node_liveness.go, line 226 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

If this interface stays, comment about the nil return please.

Done.

storage/node_liveness.go, line 241 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This could use a comment then.

Done.

Comments from Reviewable

@spencerkimball spencerkimball force-pushed the spencerkimball/node-liveness branch 3 times, most recently from 6eb0dc1 to 4b42c26 Compare September 30, 2016 14:09
@tbg
Copy link
Member

tbg commented Oct 3, 2016

Reviewed 13 of 13 files at r8.
Review status: all files reviewed at latest revision, 44 unresolved discussions, some commit checks failed.


docs/RFCS/range_leases.md, line 66 at r5 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

I added another couple of paragraphs. Not sure what you're looking for with examples. Perhaps you could suggest some?

My comment on your new paragraphs has one such example.

docs/RFCS/range_leases.md, line 68 at r7 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

leaseholder is a word.

That was also what I was thinking in the PR that introduced the rename, but I was told to spell it `LeaseHolder` instead of `Leaseholder`, indicating that it is in fact two words (at least in this codebase, by precedent).

docs/RFCS/range_leases.md, line 67 at r8 (raw file):

specify an epoch in addition to the owner, instead of using a
timestamp expiration. The lease is valid for as long as the epoch for
the lease holder is valid according to the node liveness table.  To

Double space


docs/RFCS/range_leases.md, line 73 at r8 (raw file):

least the maximum clock offset further in the future than the command
timestamp. If any of these conditions are not true, commands are
rejected before being execute (in the case of read-only commands) or

executed


docs/RFCS/range_leases.md, line 82 at r8 (raw file):

available, or may have changed at command-apply time. It should be
noted that the verification for expiration-based leases is a sanity
check, not a correctness requirement. The replica holding the lease

That's not correct. Checking upstream of Raft is a sanity check. It's necessary downstream.

  • Replica 1 receives a client request for a write
  • Replica 1 checks the lease; the write is permitted
  • Replica 1 proposes the command
  • time passes, Replica 2 commits a new lease
  • the command applies on Replica 1
  • Replica 2 serves anomalous reads which don't see the write
  • the command applies on Replica 2

I'm fairly convinced that anything without some downstream-of-Raft check is incorrect seeing that proposals can arbitrarily delay and reorder.


roachpb/data.proto, line 158 at r8 (raw file):

  // need re-gossiping after modification or range lease updates. The
  // span is set to a single key when nodes update their liveness records
  // with heartbeats to extend the expiration timestamp. Changes to then

s/then/the/


storage/node_liveness_test.go, line 147 at r8 (raw file):

  })

  // Verify epoch increment metric count.

Mind briefly stress-testing your two tests?


storage/store.go, line 103 at r8 (raw file):

// defaultRaftElectionTimeoutTicks.
func RaftElectionTimeout(raftTickInterval time.Duration, raftElectionTimeoutTicks int) time.Duration {
  if raftTickInterval == 0 {

Who would pass 0 here?


Comments from Reviewable

@tbg tbg added the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label Oct 4, 2016
@tbg
Copy link
Member

tbg commented Oct 4, 2016

Assigned stability label to mark for supervised merge.

@spencerkimball
Copy link
Member Author

Review status: all files reviewed at latest revision, 44 unresolved discussions, some commit checks failed.


docs/RFCS/range_leases.md, line 66 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

My comment on your new paragraphs has one such example.

I added your example.

docs/RFCS/range_leases.md, line 68 at r7 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

That was also what I was thinking in the PR that introduced the rename, but I was told to spell it LeaseHolder instead of Leaseholder, indicating that it is in fact two words (at least in this codebase, by precedent).

Done.

docs/RFCS/range_leases.md, line 67 at r8 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Double space

Done.

docs/RFCS/range_leases.md, line 73 at r8 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

executed

Done.

docs/RFCS/range_leases.md, line 82 at r8 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

That's not correct. Checking upstream of Raft is a sanity check. It's necessary downstream.

  • Replica 1 receives a client request for a write
  • Replica 1 checks the lease; the write is permitted
  • Replica 1 proposes the command
  • time passes, Replica 2 commits a new lease
  • the command applies on Replica 1
  • Replica 2 serves anomalous reads which don't see the write
  • the command applies on Replica 2

I'm fairly convinced that anything without some downstream-of-Raft check is incorrect seeing that proposals can arbitrarily delay and reorder.

You're correct. I realized this problem during range lease implementation and have updated the design here.

roachpb/data.proto, line 158 at r8 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

s/then/the/

Done.

storage/node_liveness_test.go, line 147 at r8 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Mind briefly stress-testing your two tests?

Done.

storage/store.go, line 103 at r8 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Who would pass 0 here?

The server does when it creates the node liveness (the normal case). It pulls the raft tick interval from the server context, where it is just set to `0`.

Comments from Reviewable

@spencerkimball
Copy link
Member Author

@tschottdorf sorry for responding with no push. Damn plane wifi won't let me. Will push in CA.

@spencerkimball spencerkimball force-pushed the spencerkimball/node-liveness branch from 4b42c26 to ce9806a Compare October 6, 2016 16:35
@spencerkimball
Copy link
Member Author

@tschottdorf ptal

@spencerkimball spencerkimball force-pushed the spencerkimball/node-liveness branch 2 times, most recently from 6837ae6 to 1f36e2f Compare October 6, 2016 20:37
@spencerkimball
Copy link
Member Author

@tschottdorf ping

@spencerkimball spencerkimball force-pushed the spencerkimball/node-liveness branch from 1f36e2f to 64b45f4 Compare October 11, 2016 05:42
@bdarnell
Copy link
Contributor

Reviewed 2 of 13 files at r8, 28 of 28 files at r9.
Review status: all files reviewed at latest revision, 26 unresolved discussions, all commit checks successful.


Comments from Reviewable

@spencerkimball
Copy link
Member Author

@cockroachdb/stability, @petermattis I'd like to merge this PR. Speak now or forever hold your peace.

@spencerkimball spencerkimball force-pushed the spencerkimball/node-liveness branch from 64b45f4 to ab5882d Compare October 12, 2016 05:39
@tbg
Copy link
Member

tbg commented Oct 12, 2016

:lgtm: and sorry for letting this sit. Note that this is still a stability label PR so please don't merge it without confirming with @petermattis first.


Reviewed 17 of 28 files at r9, 11 of 11 files at r10.
Review status: all files reviewed at latest revision, 37 unresolved discussions, some commit checks failed.


docs/RFCS/range_leases.md, line 87 at r10 (raw file):

Raft commands as `OriginLease`. At command-apply time, each node
verifies that the lease in the FSM is equal to the lease verified
upstream of Raft by the proposer.

👍 It might be worth including only a subset of data that identifies the lease uniquely instead of the full proto but let's talk about it when that change actually happens.


server/server.go, line 289 at r10 (raw file):

      RangeLeaseActiveDuration:  active,
      RangeLeaseRenewalDuration: renewal,
      Locality:                  srvCtx.Locality,

Where's this coming from?


storage/node_liveness_test.go, line 216 at r8 (raw file):

  // Verify liveness of all nodes for all nodes.
  verifyLiveness(t, mtc)

Why'd this go? Flaky? Why?


storage/node_liveness_test.go, line 236 at r10 (raw file):

          t.Fatal(err)
      }
      if atomic.LoadInt32(&count) < 2 {

Can't the callback fire a couple of times, making this potentially flaky?


storage/store.go, line 561 at r10 (raw file):

  // Locality is a description of the topography of the store.
  Locality roachpb.Locality

Ditto.


Comments from Reviewable

@spencerkimball
Copy link
Member Author

Review status: all files reviewed at latest revision, 37 unresolved discussions, some commit checks failed.


docs/RFCS/range_leases.md, line 87 at r10 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

👍 It might be worth including only a subset of data that identifies the lease uniquely instead of the full proto but let's talk about it when that change actually happens.

Yeah, perhaps. We're going to have to see what kind of performance affect sending lease has. It's pretty small potatoes though when encoded.

server/server.go, line 289 at r10 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Where's this coming from?

From a rebase. Strange...removed.

storage/node_liveness_test.go, line 216 at r8 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Why'd this go? Flaky? Why?

I don't recall why. Changing it back didn't affect determinism of test. I'm just going to leave like this. 6 of one, half dozen of the other.

storage/node_liveness_test.go, line 236 at r10 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Can't the callback fire a couple of times, making this potentially flaky?

Well, that's why we're comparing it as `< 2` instead of `==`

storage/store.go, line 561 at r10 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Ditto.

Done.

Comments from Reviewable

This change adds a node liveness table as a global system table.
Nodes periodically write updates to their liveness record by doing
a conditional put to the liveness table. The leader of the range
containing the node liveness table gossips the latest information
to the rest of the system.

Each node has a `NodeLiveness` object which can be used to query
the status of any other node to find out if it's live or non-live
according to the liveness threshold duration compared to the last
time it successfully heartbeat its liveness record.

The as-yet-unused `IncrementEpoch` mechanism is also added in this
PR, for eventual use with the planned epoch-based range leader leases.

Updated the range leader lease RFC to reflect current thinking.
@spencerkimball spencerkimball force-pushed the spencerkimball/node-liveness branch from ab5882d to 9e38221 Compare October 12, 2016 09:32
@spencerkimball spencerkimball merged commit 73b6458 into master Oct 12, 2016
@spencerkimball spencerkimball deleted the spencerkimball/node-liveness branch October 12, 2016 09:45
@petermattis
Copy link
Collaborator

Eh? I guess that is one way to sequence risky PRs.

On Wednesday, October 12, 2016, Spencer Kimball notifications@github.com
wrote:

Merged #9530 #9530.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#9530 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AF6f94YpWHkhmo1yIyn9usFA46qWQkj4ks5qzKxSgaJpZM4KF_kV
.

@spencerkimball
Copy link
Member Author

This is not a risky PR. But what is the process?

On Wed, Oct 12, 2016 at 12:11 PM Peter Mattis notifications@github.com
wrote:

Eh? I guess that is one way to sequence risky PRs.

On Wednesday, October 12, 2016, Spencer Kimball notifications@github.com
wrote:

Merged #9530 #9530.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#9530 (comment),
or mute
the thread
<
https://github.com/notifications/unsubscribe-auth/AF6f94YpWHkhmo1yIyn9usFA46qWQkj4ks5qzKxSgaJpZM4KF_kV

.


You are receiving this because you modified the open/close state.

Reply to this email directly, view it on GitHub
#9530 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AF3MTTP7US_hzKVFjMP46GzmVxTX_nx3ks5qzLJtgaJpZM4KF_kV
.

@petermattis
Copy link
Collaborator

The process was to wait for a thumbs up, merge, deploy the immediately preceding SHA and then deploy this PR. We were holding off merging any of these "stability" changes until this week's beta was baked as it contains all of the develop->master merge changes. Even if this wasn't clear because you're remote this week, @tschottdorf's comment should have alerted you:

Note that this is still a stability label PR so please don't merge it without confirming with @petermattis first.

@spencerkimball
Copy link
Member Author

Happy to revert it.

On Wed, Oct 12, 2016 at 1:15 PM Peter Mattis notifications@github.com
wrote:

The process was to wait for a thumbs up, merge, deploy the immediately
preceding SHA and then deploy this PR. We were holding off merging any of
these "stability" changes until this week's beta was baked as it contains
all of the develop->master merge changes. Even if this wasn't clear because
you're remote this week, @tschottdorf https://github.com/tschottdorf's
comment should have alerted you:

Note that this is still a stability label PR so please don't merge it
without confirming with @petermattis https://github.com/petermattis
first.

You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
#9530 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AF3MTZNRfLIfokXE0-gyh2CrkDHmYqzjks5qzMFCgaJpZM4KF_kV
.

@bdarnell
Copy link
Contributor

I'd say leave it in. I don't think this is risky enough to need a supervised merge (the change to actually use node liveness for leases will be, but this part seems pretty safe).

@tamird
Copy link
Contributor

tamird commented Oct 12, 2016

Reviewed 6 of 34 files at r1, 1 of 17 files at r3, 1 of 8 files at r7, 2 of 13 files at r8, 16 of 28 files at r9, 9 of 11 files at r10, 2 of 2 files at r11.
Review status: all files reviewed at latest revision, 50 unresolved discussions, all commit checks successful.


docs/RFCS/range_leases.md, line 48 at r11 (raw file):

# Detailed design

We introduce a new node liveness table at the beginning of the keyspace

didn't we just update the design doc to avoid the word "table" in such contexts?


keys/constants.go, line 197 at r11 (raw file):

  // expiration-based range leases instead of the more efficient
  // node-liveness epoch-based range leases (see
  // https://github.com/cockroachdb/cockroach/blob/develop/docs/RFCS/range_leases.md)

this link is broken. there is no develop branch.


server/context.go, line 117 at r11 (raw file):

  RaftTickInterval time.Duration

  // RaftElectionTimeoutTicks is the number of raft ticks before the

It'd be helpful for this comment to explain how this field relates to storage.StoreConfig.RaftElectionTimeoutTicks


storage/node_liveness.go, line 151 at r11 (raw file):

}

// ManualHeartbeat triggers a heartbeat outside of the normal periodic

this can be in helpers_test.go, can't it?


storage/node_liveness.go, line 157 at r11 (raw file):

}

// StopHeartbeat ends the heartbeat loop. Used for unittesting.

ditto


storage/node_liveness.go, line 313 at r11 (raw file):

          var actualLiveness Liveness
          if err := cErr.ActualValue.GetProto(&actualLiveness); err != nil {
              return errors.Errorf("couldn't update node liveness from CPut actual value: %s", err)

errors.Wrapf


storage/node_liveness.go, line 328 at r11 (raw file):

  var liveness Liveness
  if err := content.GetProto(&liveness); err != nil {
      log.Error(context.Background(), err)

this is an inappropriate use of context.Background().


storage/node_liveness_test.go, line 43 at r11 (raw file):

          for _, g := range mtc.gossips {
              live, err := nl.IsLive(g.GetNodeID())
              if !live {

seems like the error should take precedence; won't this always be true while err is non-nil?


storage/node_liveness_test.go, line 62 at r11 (raw file):

func TestNodeLiveness(t *testing.T) {
  defer leaktest.AfterTest(t)()
  mtc := startMultiTestContext(t, 3)

shouldn't we be moving to testcluster?


storage/node_liveness_test.go, line 76 at r11 (raw file):

      nodeID := mtc.gossips[idx].GetNodeID()
      live, err := nl.IsLive(nodeID)
      if live {

ditto: shouldn't the error take precedence?


storage/node_liveness_test.go, line 114 at r11 (raw file):

  if err := mtc.nodeLivenesses[0].IncrementEpoch(
      context.Background(), deadNodeID); !testutils.IsError(err, "cannot increment epoch on live node") {
      t.Fatalf("expected error incrementing a live node")

include the value of err


storage/node_liveness_test.go, line 142 at r11 (raw file):

      }
      if live, err := mtc.nodeLivenesses[0].IsLive(deadNodeID); live || err != nil {
          return errors.Errorf("expected dead node to remain dead after epoch increment %t: %s", live, err)

%v because err might be nil.


storage/node_liveness_test.go, line 166 at r11 (raw file):

  // Clear the liveness records in store 1's gossip to make sure we're
  // seeing the liveness record properly gossiped at store startup.
  var expKeys []string

seems like using a map[string]struct{} here would have been simpler.


Comments from Reviewable

@petermattis
Copy link
Collaborator

Leave it in, though you did make a couple of tests flaky (i.e. TestConcurrentRaftSnapshots).

@spencerkimball
Copy link
Member Author

@tamird I made most of your suggested changes (sending another PR). ManualHeartbeat is still in because it's actually not just for unittesting afterall and the upcoming range lease change requires it.


Comments from Reviewable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants