-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stability: implement node liveness; first step towards new range leases #9530
Conversation
@@ -353,7 +344,8 @@ func (ds *DistSender) sendRPC( | |||
ba roachpb.BatchRequest, | |||
) (*roachpb.BatchResponse, error) { | |||
if len(replicas) == 0 { | |||
return nil, noNodeAddrsAvailError{} | |||
return nil, roachpb.NewSendError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change allows the system to properly retry such errors instead of failing the request to the distributed sender.
01f8b9e
to
6467402
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks like a good change to me, although I still don't think we should include this under the code yellow. To my mind it's out of scope for the stability crisis so we should either end the code yellow first or revise our policies to make them more sustainable for a code yellow that's going to go on for a longer time (i.e. in either case we should merge the master and develop branches before this change goes in).
node lease record to ensure that the epoch counter is consistent and | ||
the start time is greater than the prior range lease holder’s node | ||
lease expiration (plus the maximum clock offset). | ||
and a node heartbeat timestamp. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Storing expiration timestamps rather than last-heartbeat timestamps is generally better because it makes it possible to change the heartbeat interval and lease timeout. (The expiration timestamp is a promise by that node not to serve any commands after expiration, so it relies only on that node's timing parameters. If each node independently computes the expiration based on the last heartbeat, then all nodes must agree about the lease timeout and it becomes very difficult to ever change).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm amending this to including a duration as well as the last heartbeat.
liveness table. They do not consult their outlook on the liveness | ||
table and can even be disconnected from gossip. | ||
|
||
[NB: previously this RFC recommended a distributed transaction to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RFC template has an "alternatives" section for this kind of thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't rise to the level of an alternative to the proposed design. I just wanted to provide a complete description of why a distributed txn with the liveness record isn't necessary for posterity. I've moved the bulk of this comment down into a subsection of Alternatives and added a nota bene link.
an HLC clock time corresponding to the now-old epoch at which it | ||
acquired the lease.] | ||
|
||
If gossip cannot be reached by a node, then it will lose the ability |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does gossip matter? Isn't it enough that the node was able to execute its conditional put to increase its timestamp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It actually doesn't. As long as the node can heartbeat, it's set. I removed this paragraph.
range lease is being removed. `AdminTransferLease` will be enhanced to | ||
perform transfers correctly using node lease style range leases. | ||
|
||
Nodes which propose or transfer an epoch-based leader lease must |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Throughout: we call it the "range lease", not the "leader lease" now, to avoid confusion with raft leadership.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Changed in the code now too.
3| 3| 1| - | ||
4| 2| 19| - | ||
|
||
On write to range 4, leader lease is invalid. To get a new lease, a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this first sentence mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this section now that I added the liveness table section at Vivek's request.
verifyLiveness(t, mtc) | ||
mtc.stopStore(0) | ||
|
||
// Create a new gossip instance and connect it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you just make two stores in the multiTestContext instead of creating a second gossip instance by hand?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rewritten to avoid use of a new gossip and also uses two nodes to verify we're sending more than just the first.
// must hold the lease to a range which contains some or all of the | ||
// node liveness records. After scanning the records, it checks | ||
// against what's already in gossip and only gossips records which | ||
// are out of date. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be simpler to have gossip do this check itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gossip.AddInfo could check and see if the new value is already present, and if so do nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, moved this check into gossip and added a test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I had to back out this change. Too many tests rely on the existing behavior that we re-gossip identical data. This just isn't worth figuring out for this PR.
ba.Timestamp = r.store.Clock().Now() | ||
ba.Add(&roachpb.ScanRequest{Span: span}) | ||
br, trigger, pErr := | ||
r.executeBatch(r.ctx, storagebase.CmdIDKey(""), r.store.Engine(), nil, ba) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this use r.Send instead of going directly to executeBatch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the pattern used in gossiping the system span and @tschottdorf encouraged that I continue following it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this should be consistent between the two, but I can't think of why it should be this way. Maybe add a TODO to change both of them to use Replica.Send (or document why we need this more complicated process).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out we need to do this to avoid reentry on the same key in the command queue, which blocks this from completion. Added comments in both locations.
br, trigger, pErr := | ||
r.executeBatch(r.ctx, storagebase.CmdIDKey(""), r.store.Engine(), nil, ba) | ||
if pErr != nil { | ||
log.Errorf(r.ctx, "couldn't scan node liveness records in span %s: %s", span, pErr.GoError()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should probably return instead of logging and continuing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Done.
// UserDataSpan is the non-meta and non-structured portion of the key space. | ||
UserDataSpan = roachpb.Span{Key: SystemMax, EndKey: TableDataMin} | ||
|
||
// SystemDataSpans are spans which contain system data which needs to be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/SystemDataSpans/GossipedSystemSpans/g
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
timestamp greater than the latest expiration timestamp it has written | ||
to the node lease table. | ||
timestamp greater than the latest heartbeat timestamp plus a liveness | ||
threshold duration. Note that the heartbeat is specified according to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are trying to express something like spanner's "disjointed invariant" here. It will be better to say that the lease expires after the heartbeat-timestamp + liveness-threshold-duration . Not clear what this liveness-threshold-duration is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, reworded slightly. The liveness threshold duration is computed based on the raft election timeout and a constant multiple, defined in the code.
@@ -47,28 +47,20 @@ its replicas holding range leases at once? | |||
|
|||
We introduce a new node lease table at the beginning of the keyspace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you specify the columns in this table and describe them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
8f62737
to
fafaa3c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've only reviewed the change to the RFC so far. I'll get to the code tomorrow.
to set the new leaseholder or else set the leaseholder to 0. This is | ||
necessary in the case of rebalancing when the node that holds the | ||
range lease is being removed. `AdminTransferLease` will be enhanced to | ||
perform transfers correctly using node lease style range leases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/node lease style/epoch-based/g
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
themselves be live according to the liveness table. Keep in mind that | ||
a node considers itself live according to whether it has successfully | ||
written a recent liveness record which proves its liveness measured | ||
by current time vs the record's expiration module the maximum clock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/module/modulo/g
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
details on why that's unnecessary.] | ||
|
||
In addition to nodes updating their own liveness entry with ongoing | ||
updates via conditional puts, non-leaseholder nodes may increment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"updating...with ongoing updates" is one too many "updates".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
In addition to nodes updating their own liveness entry with ongoing | ||
updates via conditional puts, non-leaseholder nodes may increment | ||
the epoch of a node which has failed to update its heartbeat in time | ||
to keep it younger than the threshold liveness duration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"...keep it younger than the expiration time".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
With 1,000 nodes and a 9s liveness duration threshold, we expect every | ||
node to do a conditional put to update the heartbeat timestamp every | ||
7.2s. That would correspond to ~140 reqs/second, a not-unreasonable | ||
load for this function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Elsewhere in the document we talk about a 10,000 range table requiring 1,388 Raft commits per second. It would be nice to reiterate how many Raft ops/sec the old (non-epoch-based range leases) would require for a 1,000 node cluster with 10,000 replicas per node (3.3 million ranges).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
We still require the traditional expiration-based range leases for any | ||
ranges located at or before the liveness table's range. This might be | ||
problematic in the case of meta2 address record ranges, which are | ||
expected to proliferate in a large cluster. This lease traffic could |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to put a number to the number of ranges which couldn't use epoch-based range leases. How many meta2 ranges will there be for the 3.3 million range cluster example I mention? I'm not sure how big RangeDescriptors
currently are (we should measure this), but let's assume 1KB per range of meta2 data. A single meta2 range could hold 64K which would translate into 50 meta2 ranges. That doesn't seem worth optimizing.
Update: just checked on a local cluster and the size of a meta2 key+value averaged 184 bytes. That would translate into 364K ranges per meta2 range and we'd need 10 meta2 ranges to support 3.3 million ranges. You should probably double check my math here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well you end up with some historical versions and other dross, so I agree somewhere between 10 and 50 meta2 ranges. Doesn't seem worth the effort. I added a note.
## Use of distributed txn for updating liveness records | ||
|
||
The original proposal mentioned: "The range lease record is always | ||
updated in a distributed transaction with the -node lease record to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/-node lease/node-lease/g
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to node liveness table
|
||
The original proposal mentioned: "The range lease record is always | ||
updated in a distributed transaction with the -node lease record to | ||
ensure that the epoch counter is consistent and -the start time is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/-the/the/g
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
TODO(peter): What is the motivation for gossipping the node lease | ||
table? Gossipping means the node's will have out of date info for the | ||
TODO(peter): What is the motivation for gossiping the node lease | ||
table? Gossiping means the node's will have out of date info for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mind addressing this TODO? @bdarnell gave justification elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
// expiration-based range leases instead of the more efficient | ||
// node-liveness epoch-based range leases (see | ||
// https://github.com/cockroachdb/cockroach/blob/develop/docs/RFCS/range_leases.md) | ||
NodeLivenessPrefix = roachpb.Key(makeKey(SystemPrefix, roachpb.RKey("\x00liveness-"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we almost never see raw keys anymore, is there a reason for the key to be so long? Probably doesn't matter (as there is only one liveness key per node), but it grates me a little.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's only as long as the others.
fafaa3c
to
97ad64b
Compare
record to increase the expiration timestamp and ensure that the epoch | ||
has not changed. If the epoch does change, *all* of the range leases | ||
held by this node are revoked. A node *must not* propose any commands | ||
with a timestamp greater than its expiration timestamp modulo the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/modulo/minus/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Each node periodically performs a conditional put to its node liveness | ||
record to increase the expiration timestamp and ensure that the epoch | ||
has not changed. If the epoch does change, *all* of the range leases | ||
held by this node are revoked. A node *must not* propose any commands |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add "or serve any reads" at the end of this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -152,6 +152,7 @@ message ChangeReplicasTrigger { | |||
// This can be used to trigger scan-and-gossip for the given span. | |||
message ModifiedSpanTrigger { | |||
optional bool system_config_span = 1 [(gogoproto.nullable) = false]; | |||
optional Span node_liveness_span = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, add a comment here about the difference.
// than a threshold liveness duration. | ||
int64 epoch = 2; | ||
// The timestamp at which this liveness record expires. | ||
util.hlc.Timestamp expiration = 3 [(gogoproto.nullable) = false]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Store both the expiration and the stasis time (or the max offset), to make it possible to change the max offset safely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed this same pattern in the range lease proto. It seems like a bandaid on a gunshot wound. The reality is you simply will not be able to jigger max offsets without a stop-the-world cluster restart and a delay of at least the maximum lease expiration where max offset might matter. Just because the writer of this record and the reader(s) agree on the stasis timestamp doesn't protect from the inconsistencies which could arise if max offset were changed. To see why, consider the following scenario:
- Node A has a max offset = 10ns, accurate clock
- Node A sets liveness record expiration = @100ns, stasis = @90ns
- Node B, with max offset = 20ns and a fast clock by 20ns, bumps Node A's epoch @101ns (which is @81ns, according to Node A's clock)
- Node A & B can both believe they have leases
I think this "stasis" timestamp concept should be abolished in light of it being confusing, cognitively complex, and ultimately nothing but a Potemkin protection. We should be writing code with the assumption that all nodes agree on max offset. I don't believe we can do otherwise without kidding ourselves about guarantees.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we should avoid relying on the stasis timestamp as much as we can. On current master
, it's necessary to use it and it was deemed more explicit to have it in the protos than pulling it from configuration, where it's more obvious that it does matter. I agree that shortening the MaxOffset is a can of worms we don't want to open now. Instead I think we should improve the detection of an obvious misconfiguration by storing the stasis timestamp and letting nodes verify that it agrees with what they believe the MaxOffset to be.
I'm just starting to look at this change, so excuse any lack of global context,.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, my analysis is wrong as long as both Node A and Node B use their respective clock offsets in relation to the stasis time (i.e. Node B should consider expiration to be @110NS). But that means we only need one timestamp. Node A stops serving reads at Expiration
; Node A only considers a lease or liveness record expired at Expiration + MaxOffset
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, actually I think we do need to send the MaxOffset
because the case I just mentioned only holds if Node B's MaxOffset
is greater than or equal to Node A's. What we need is Node A to stop serving reads at Expiration
and Node B to consider a lease or liveness record expired only after Expiration + Max(MaxOffset[Node B], MaxOffset[Node A])
.
But again, this just is begging the question of whether we ought to support changing max offset without requiring a cluster freeze. Is this kind of complexity healthy when you weigh costs and benefits?
// computed from the specified tick interval and the default number of | ||
// election timeout ticks. | ||
func RaftElectionTimeout(raftTickInterval time.Duration) time.Duration { | ||
return time.Duration(defaultRaftElectionTimeoutTicks) * raftTickInterval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RaftElectionTimeoutTicks can be changed; we shouldn't hard-code the default here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
97ad64b
to
e8dd5da
Compare
Reviewed 1 of 9 files at r5. docs/RFCS/range_leases.md, line 107 at r3 (raw file):
|
Review status: 1 of 36 files reviewed at latest revision, 44 unresolved discussions, some commit checks failed. docs/RFCS/range_leases.md, line 107 at r3 (raw file):
|
Review status: 1 of 36 files reviewed at latest revision, 43 unresolved discussions, some commit checks failed. docs/RFCS/range_leases.md, line 66 at r5 (raw file):
|
e8dd5da
to
bb05979
Compare
Review status: 1 of 36 files reviewed at latest revision, 43 unresolved discussions, some commit checks failed. docs/RFCS/range_leases.md, line 66 at r5 (raw file):
|
Reviewed 16 of 34 files at r1, 2 of 6 files at r2, 9 of 17 files at r3, 8 of 9 files at r5, 3 of 3 files at r6. docs/RFCS/range_leases.md, line 61 at r5 (raw file):
Heartbeats are not mentioned here nor explained anywhere else though they are referenced and implemented. Please clarify throughout. docs/RFCS/range_leases.md, line 66 at r5 (raw file):
|
bb05979
to
1ee4a91
Compare
Review status: all files reviewed at latest revision, 58 unresolved discussions, some commit checks failed. docs/RFCS/range_leases.md, line 61 at r5 (raw file):
|
Reviewed 16 of 34 files at r1, 2 of 6 files at r2, 4 of 17 files at r3, 6 of 9 files at r5, 8 of 8 files at r7. server/server.go, line 210 at r7 (raw file):
This needs to use storage/node_liveness.go, line 75 at r2 (raw file):
|
Reviewed 8 of 8 files at r7. docs/RFCS/range_leases.md, line 66 at r5 (raw file):
|
Review status: all files reviewed at latest revision, 41 unresolved discussions, all commit checks successful. storage/replica.go, line 3404 at r2 (raw file):
|
Review status: all files reviewed at latest revision, 41 unresolved discussions, all commit checks successful. storage/replica.go, line 3404 at r2 (raw file):
|
circular dependencies. This table maps node IDs to an epoch counter, | ||
and an expiration timestamp. | ||
|
||
## Liveness table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Node liveness table
liveness updates will simply resort to a conditional put to increment | ||
a seemingly not-live node's liveness epoch. The conditional put will | ||
fail because the expected value is out of date and the correct liveness | ||
info is returned to the caller. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This paragraph is better placed further down. Perhaps after the next two paragraphs
epoch. If a node is down (and its node lease has expired), another | ||
node may revoke its lease(s) by incrementing the node lease | ||
epoch. If a node is down (and its node liveness has expired), another | ||
node may revoke its lease(s) by incrementing the node liveness | ||
epoch. Once this is done the old range lease is invalidated and a new | ||
node may claim the range lease. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's nice to be explicit here about the disjointed invariant. A range lease can move from node1 to node2 only after the node1's liveness record has expired, and node2 has a valid unexpired liveness epoch.
epoch. If a node is down (and its node lease has expired), another | ||
node may revoke its lease(s) by incrementing the node lease | ||
epoch. If a node is down (and its node liveness has expired), another | ||
node may revoke its lease(s) by incrementing the node liveness |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which node liveness?
Review status: all files reviewed at latest revision, 45 unresolved discussions, all commit checks successful. docs/RFCS/range_leases.md, line 66 at r5 (raw file):
|
6eb0dc1
to
4b42c26
Compare
Reviewed 13 of 13 files at r8. docs/RFCS/range_leases.md, line 66 at r5 (raw file):
|
Assigned |
Review status: all files reviewed at latest revision, 44 unresolved discussions, some commit checks failed. docs/RFCS/range_leases.md, line 66 at r5 (raw file):
|
@tschottdorf sorry for responding with no push. Damn plane wifi won't let me. Will push in CA. |
4b42c26
to
ce9806a
Compare
@tschottdorf ptal |
6837ae6
to
1f36e2f
Compare
@tschottdorf ping |
1f36e2f
to
64b45f4
Compare
Reviewed 2 of 13 files at r8, 28 of 28 files at r9. Comments from Reviewable |
@cockroachdb/stability, @petermattis I'd like to merge this PR. Speak now or forever hold your peace. |
64b45f4
to
ab5882d
Compare
and sorry for letting this sit. Note that this is still a Reviewed 17 of 28 files at r9, 11 of 11 files at r10. docs/RFCS/range_leases.md, line 87 at r10 (raw file):
👍 It might be worth including only a subset of data that identifies the lease uniquely instead of the full proto but let's talk about it when that change actually happens. server/server.go, line 289 at r10 (raw file):
Where's this coming from? storage/node_liveness_test.go, line 216 at r8 (raw file):
Why'd this go? Flaky? Why? storage/node_liveness_test.go, line 236 at r10 (raw file):
Can't the callback fire a couple of times, making this potentially flaky? storage/store.go, line 561 at r10 (raw file):
Ditto. Comments from Reviewable |
Review status: all files reviewed at latest revision, 37 unresolved discussions, some commit checks failed. docs/RFCS/range_leases.md, line 87 at r10 (raw file):
|
This change adds a node liveness table as a global system table. Nodes periodically write updates to their liveness record by doing a conditional put to the liveness table. The leader of the range containing the node liveness table gossips the latest information to the rest of the system. Each node has a `NodeLiveness` object which can be used to query the status of any other node to find out if it's live or non-live according to the liveness threshold duration compared to the last time it successfully heartbeat its liveness record. The as-yet-unused `IncrementEpoch` mechanism is also added in this PR, for eventual use with the planned epoch-based range leader leases. Updated the range leader lease RFC to reflect current thinking.
ab5882d
to
9e38221
Compare
Eh? I guess that is one way to sequence risky PRs. On Wednesday, October 12, 2016, Spencer Kimball notifications@github.com
|
This is not a risky PR. But what is the process? On Wed, Oct 12, 2016 at 12:11 PM Peter Mattis notifications@github.com
|
The process was to wait for a thumbs up, merge, deploy the immediately preceding SHA and then deploy this PR. We were holding off merging any of these "stability" changes until this week's beta was baked as it contains all of the develop->master merge changes. Even if this wasn't clear because you're remote this week, @tschottdorf's comment should have alerted you:
|
Happy to revert it. On Wed, Oct 12, 2016 at 1:15 PM Peter Mattis notifications@github.com
|
I'd say leave it in. I don't think this is risky enough to need a supervised merge (the change to actually use node liveness for leases will be, but this part seems pretty safe). |
Reviewed 6 of 34 files at r1, 1 of 17 files at r3, 1 of 8 files at r7, 2 of 13 files at r8, 16 of 28 files at r9, 9 of 11 files at r10, 2 of 2 files at r11. docs/RFCS/range_leases.md, line 48 at r11 (raw file):
didn't we just update the design doc to avoid the word "table" in such contexts? keys/constants.go, line 197 at r11 (raw file):
this link is broken. there is no develop branch. server/context.go, line 117 at r11 (raw file):
It'd be helpful for this comment to explain how this field relates to storage/node_liveness.go, line 151 at r11 (raw file):
this can be in storage/node_liveness.go, line 157 at r11 (raw file):
ditto storage/node_liveness.go, line 313 at r11 (raw file):
errors.Wrapf storage/node_liveness.go, line 328 at r11 (raw file):
this is an inappropriate use of storage/node_liveness_test.go, line 43 at r11 (raw file):
seems like the error should take precedence; won't this always be true while err is non-nil? storage/node_liveness_test.go, line 62 at r11 (raw file):
shouldn't we be moving to testcluster? storage/node_liveness_test.go, line 76 at r11 (raw file):
ditto: shouldn't the error take precedence? storage/node_liveness_test.go, line 114 at r11 (raw file):
include the value of storage/node_liveness_test.go, line 142 at r11 (raw file):
%v because storage/node_liveness_test.go, line 166 at r11 (raw file):
seems like using a Comments from Reviewable |
Leave it in, though you did make a couple of tests flaky (i.e. |
@tamird I made most of your suggested changes (sending another PR). Comments from Reviewable |
This change adds a node liveness table as a global system table.
Nodes periodically write updates to their liveness record by doing
a conditional put to the liveness table. The leader of the range
containing the node liveness table gossips the latest information
to the rest of the system.
Each node has a
NodeLiveness
object which can be used to querythe status of any other node to find out if it's live or non-live
according to the liveness threshold duration compared to the last
time it successfully heartbeat its liveness record.
The as-yet-unused
IncrementEpoch
mechanism is also added in thisPR, for eventual use with the planned epoch-based range leader leases.
Updated the range leader lease RFC to reflect current thinking.
This change is