-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: add assertions for invariants around liveness records #54544
kvserver: add assertions for invariants around liveness records #54544
Conversation
11da600
to
e8f6137
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @tbg)
pkg/cli/error.go, line 364 at r1 (raw file):
// Are we trying to recommission/decommision a node that does not exist? if strings.Contains(err.Error(), kvserver.ErrMissingLivenessRecord.Error()) {
not the right place for this check.
Say you're running a distsql query, and for some catastrophic reason a liveness record goes missing which causes an assertion error in SQL. We definitely want to see this in the output of the CLI command!
what you want is a check for this error specifically in the node
(sub-)commands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the current upgrade semantics (i.e. no long-running migrations), I think the old code is still necessary. If a node joins at 20.1 and is down throughout 20.2 but comes back at 21.1 (as unlikely as this is, it would require either weird recommissioning or manual version bump) then it may not have a liveness record. In practice I think this is exceedingly unlikely (outside of contrived tests) and when it occurs, we can work past it.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @tbg)
pkg/kv/kvserver/node_liveness.go, line 268 at r1 (raw file):
for r := retry.StartWithCtx(ctx, base.DefaultRetryOptions()); r.Next(); { oldLivenessRec, err := nl.SelfEx() if err != nil && !errors.Is(err, errLivenessRecordCacheMiss) {
See other comment further below about this pattern.
pkg/kv/kvserver/node_liveness.go, line 633 at r1 (raw file):
for r := retry.StartWithCtx(ctx, retryOpts); r.Next(); { oldLiveness, err := nl.Self() if err != nil && !errors.Is(err, errLivenessRecordCacheMiss) {
If you get an error here, it seems like you're still going to heartbeatInternal with a zero liveness. That does not seem like what you want any more. Also, the error handling here seems clunky. Why not
oldLiveness, err := nl.Self()
if errors.Is(err, errLivenessRecordCacheMiss) {
oldLiveness, err = nl.getLivenessRecordFromKV(ctx, nodeID)
}
if err != nil {
log.Infof(ctx, "unable to get liveness record: %s", err)
continue
}
// ...
There's a similar pattern higher up in the review that should also get a second look.
Hm, that sucks. I wonder if the thing we want here for iron-clad guarantees is an RPC-level check gating the rando node from joining the cluster (separately it'd be a good idea to disallow RPCs from nodes with no liveness records in KV, cause that should never happen anyway going forward). If we had that kind of check which prevented this previously zombie node from re-joining the cluster, would that be acceptable? I'm also happy to just live with our earlier code, was just trying to sand down some sharp edges here motivated by nothing in particular (which evidently brings more by its own). |
If you haven't had a node come live at all in 20.2 and are expecting it to run in 21.1, I think it's fine if that expectation is occasionally not going to pan out (I'm pretty sure this is not something we even support doing). Let's go ahead with this PR. |
2e78dcc
to
dea687f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL, I also re-worked the NodeLiveness.getLivenessLocked method signature (and propagated it through the callers):
-func (nl *NodeLiveness) getLivenessLocked(nodeID roachpb.NodeID) (LivenessRecord, error)
+func (nl *NodeLiveness) getLivenessLocked(nodeID roachpb.NodeID) (_ LivenessRecord, ok bool)
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz and @tbg)
pkg/cli/error.go, line 364 at r1 (raw file):
Previously, knz (kena) wrote…
not the right place for this check.
Say you're running a distsql query, and for some catastrophic reason a liveness record goes missing which causes an assertion error in SQL. We definitely want to see this in the output of the CLI command!what you want is a check for this error specifically in the
node
(sub-)commands.
Done.
pkg/kv/kvserver/node_liveness.go, line 268 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
See other comment further below about this pattern.
Done.
pkg/kv/kvserver/node_liveness.go, line 633 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
If you get an error here, it seems like you're still going to heartbeatInternal with a zero liveness. That does not seem like what you want any more. Also, the error handling here seems clunky. Why not
oldLiveness, err := nl.Self() if errors.Is(err, errLivenessRecordCacheMiss) { oldLiveness, err = nl.getLivenessRecordFromKV(ctx, nodeID) } if err != nil { log.Infof(ctx, "unable to get liveness record: %s", err) continue } // ...There's a similar pattern higher up in the review that should also get a second look.
Done.
dea687f
to
d66a0f0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, the more I look at this code, the more questions I have about it (not your fault -the code was junk and you're merely alerting me to it).
However- since you're introducing a way to read from KV directly, I think we need to get that part right. Something still seems convoluted there, as you can see from my partially confused comments. It looks like the main method, GetLiveness()
, does not fetch from KV; it just consults the cache. That is unexpected and as a result there's a lot of confusion about when we can assert on a record being returned from methods. You would think that we would default to populating the cache from KV and that not doing so amounts to "peeling back" the abstraction because you are an advanced internal user. I think by exploring this approach we'll significantly simplify the liveness code because the cache updating and db fetching will live in one code path that provides the foundation for the higher-level functions of the pkg (membership status, etc).
Btw, the signature changes look good, though they should be pulled out - this diff is already tricky to review because the code isn't in the best place.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @knz, and @tbg)
pkg/cli/node.go, line 344 at r2 (raw file):
if err := runDecommissionNodeImpl(ctx, c, nodeCtx.nodeDecommissionWait, nodeIDs); err != nil { // Are we trying to recommission/decommision a node that does not exist? if strings.Contains(err.Error(), kvserver.ErrMissingLivenessRecord.Error()) {
We need the same in recommission, right?
Also, wouldn't it be better if we use a grpc error code here:
cockroach/pkg/server/server.go
Lines 1966 to 1969 in 532870f
statusChanged, err := s.nodeLiveness.SetMembershipStatus(ctx, nodeID, targetStatus) | |
if err != nil { | |
return err | |
} |
if errors.Is(err, ErrMissingLivenessRecord)
and then on the client taylor the message based on the error code?
That would seem more idiomatic: avoids the string matching, cli->kvserver dependency (which goes across possibly multiple versions).
pkg/jobs/registry.go, line 790 at r2 (raw file):
if !ok { if nodeLivenessLogLimiter.ShouldLog() { log.Warning(ctx, "unable to get node liveness")
"own liveness record not found"?
pkg/kv/kvserver/node_liveness.go, line 274 at r2 (raw file):
livenessRec, err := nl.getLivenessRecordFromKV(ctx, nodeID) if err != nil { log.Infof(ctx, "unable to get liveness record: %s", err)
add "from KV"
pkg/kv/kvserver/node_liveness.go, line 396 at r2 (raw file):
if oldLivenessRec.Liveness == (kvserverpb.Liveness{}) { log.Fatal(ctx, "invalid old liveness record; found to be empty")
Return an error here and everywhere else we assert on this stuff. I was thinking about the assertions a bit more and it seems that in the unlikely case in which they fire, they would not necessarily fire on the node that actually has the missing record (i.e. the one that missed a release). We definitely want to avoid that.
pkg/kv/kvserver/node_liveness.go, line 516 at r2 (raw file):
) (statusChanged bool, err error) { if oldLivenessRec.Liveness == (kvserverpb.Liveness{}) { log.Fatal(ctx, "invalid old liveness record; found to be empty")
return err
pkg/kv/kvserver/node_liveness.go, line 575 at r2 (raw file):
if !ok { // TODO(irfansharif): Is this conditional possible given we're always // persisting the liveness record?
It should not, right? GetLiveness
hits the db when it needs to?
It looks to me as though you should be returning an assertion failed error here (errors.AssertionFailedf
, just like for the other assertions)
pkg/kv/kvserver/node_liveness.go, line 954 at r2 (raw file):
func (nl *NodeLiveness) getLivenessLocked(nodeID roachpb.NodeID) (_ LivenessRecord, ok bool) { if l, ok := nl.mu.nodes[nodeID]; ok { return l, true
Oh, huh, I guess it doesn't get it from KV. That's because you want to avoid circular dependencies, right? Like checking the liveness to get lease on the liveness table leading to reading from the liveness table?
Something here is still wonky. You'd expect querying the cache directly to be the norm and that only at the range level (where you worry about the circularity) we consult the cache.
The "look-aside cache" structure here was intentional, though I could've communicated that better. My thinking was that I wanted a more easily verifiable progression between where we are today (which is this confusing look-aside cache structure) and where we want to be (a look-through cache, probably in future refactoring). Let me try to re-explain. The intended "diff" this PR introduces is probably best exemplified by the following snippet: @@ -631,9 +625,15 @@ func (nl *NodeLiveness) StartHeartbeat(
func(ctx context.Context) error {
// Retry heartbeat in the event the conditional put fails.
for r := retry.StartWithCtx(ctx, retryOpts); r.Next(); {
- oldLiveness, err := nl.Self()
- if err != nil && !errors.Is(err, ErrNoLivenessRecord) {
- log.Errorf(ctx, "unexpected error getting liveness: %+v", err)
+ oldLiveness, ok := nl.Self()
+ if !ok {
+ nodeID := nl.gossip.NodeID.Get()
+ liveness, err := nl.getLivenessFromKV(ctx, nodeID)
+ if err != nil {
+ log.Infof(ctx, "unable to get liveness record: %s", err)
+ continue
+ }
+ oldLiveness = liveness
}
if err := nl.heartbeatInternal(ctx, oldLiveness, incrementEpoch); err != nil {
if errors.Is(err, ErrEpochIncremented) { Previously we did this funky thing where we consulted our look-aside cache ( I'll try separating out the signature diff and making the error changes, plus add more clarifying commentary for what future progression for this NodeLiveness guy should look like (tl;dr use a look-through cache instead of our current look-aside one). |
Hm, I don't think I quite understand the distinction you're drawing between simply Fatal-ing away vs. returning assertion failure errors. Looking at our existing code, we have these "blind retry on error" loops today, like below: cockroach/pkg/kv/kvserver/node_liveness.go Lines 267 to 271 in b6aa84a
These would just swallow any assertion failure errors we'd want to return. Given these are are assertion failures, I also don't know that we should worry about where they're firing, just that they are firing. I think it also simplifies the code a bit to simply Fatal away instead of returning a possibly swallowed error. |
Release note: None
d66a0f0
to
17b74d5
Compare
Done.
I completely agree. My primary intent with this PR was to simply add these assertions in to try and to prove to ourselves that liveness records do in fact, always exist. I'm reasonably comfortable with that now that this goes through CI just fine, and the few test cluster I've spun up have yet to fall over. So for now I'm a bit inclined to live with this "peeled back" abstraction for a while. I think it's no worse than our previous state of things (and hopefully a bit better having removed empty liveness handling code). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for reworking the commit history! This was great to review (despite reviewable being a little confused at this point). I think we're on home stretch here. The only comment worth highlighting is that we must not hold the mutex across the KV request.
Reviewed 14 of 14 files at r3, 3 of 3 files at r4, 2 of 2 files at r5, 4 of 4 files at r6, 1 of 1 files at r7, 2 of 2 files at r8, 11 of 11 files at r9.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @knz)
pkg/kv/kvserver/node_liveness.go, line 278 at r4 (raw file):
// best effort attempt at marking the node as draining. It can return // without an error after trying unsuccessfully a few times. Is that what we // want?
No, this seems to be another one of these cases where the retry loop bailing out is not correctly terminated. Please fix!
Something like
if err := ctx.Err(); err != nil { return err }
return errors.New("failed")
should do the trick (the final return should be unreachable hopefully)
pkg/kv/kvserver/node_liveness.go, line 1225 at r6 (raw file):
var shouldReplace bool nl.mu.Lock() oldLivenessRec, err := nl.getLivenessLocked(newLivenessRec.NodeID)
I don't think we should be performing KV operations under nl.mu
. Please make sure we never do that.
As a consequence, I think we should be going through shouldReplaceLiveness
in every case (i.e. not set it to true here).
pkg/kv/kvserver/node_liveness.go, line 611 at r9 (raw file):
liveness, ok := nl.GetLiveness(nodeID) if !ok { // TODO(irfansharif): Is this conditional possible given we're always
At the very least, you'd hit it for looking up nodes that never existed.
You're really saying that you don't expect any callers to put in a nodeID here that they haven't learned from some liveness record (which sort-of implies that they got the liveness from our cache, etc)?
And then maybe you're saying this should return the "not found" error instead of the cache miss?
Clarify the TODO, please.
pkg/kv/kvserver/node_liveness.go, line 976 at r9 (raw file):
// GetLiveness returns the liveness record for the specified nodeID. If the // liveness record is not found (due to gossip propagation delays), we surface
(or because the node does not exist)
pkg/kv/kvserver/node_liveness.go, line 988 at r9 (raw file):
// getLivenessLocked returns the liveness record for the specified nodeID, // consulting the in-memory cache. If nothing is found (could happen due to // gossip propagation delays), we surface that to the caller.
(or because not does not exist)
pkg/kv/kvserver/node_liveness.go, line 1184 at r9 (raw file):
l, ok := nl.GetLiveness(update.newLiveness.NodeID) if !ok { // TODO(irfansharif): Is this conditional possible given we're
Maybe link this to the other TODO since you're going to write more prose there? Here it's even less likely that we'll hit this path, maybe if we tried to decom a nonexisting node? But we'd bail out higher up.
pkg/kv/kvserver/node_liveness_test.go, line 799 at r4 (raw file):
} if err := mtc.nodeLivenesses[drainingNodeIdx].SetDraining(ctx, true, nil); err != nil {
Why'd you remove the inline comments? According to our style guide, they should remain.
We'll need it in a future commit. Not returning an error masked an existing bug (which we leave unfixed for now) where it was possible for us to be unable to mark the target node as draining and still proceed. The usage of SetDraining does not assume it to be a best-effort attempt. Release note: None
In an earlier refactor we observed that it was possible to return successfully from a node drain attempt despite not having fully drained the target node. This made the `cockroach node drain` invocation a best effort attempt, not a guaranteed one, which was not the intent. We return the appropriate failure now. Release note: None
In a future commit we'll introduce assertions that this is no longer possible. Let's update our tests to be more representative of inputs we'd normally expect. Release note: None
Now that we have cockroachdb#53842, we maintain the invariant that there always exists a liveness record for any given node. We can now simplify our handling of liveness records internally: where previously we had code to handle the possibility of empty liveness records (we created a new one on the fly), we change them to assertions that verify that empty liveness records are no longer flying around in the system. When retrieving the liveness record from our in-memory cache, it was possible for us to not find anything due to gossip delays. Instead of simply giving up then, now we can read the records directly from KV (and evebtually update our caches to store this newly read record). This PR introduces this mechanism through usage of `getLivenessRecordFromKV`. We should note that the existing cache structure within NodeLiveness is a look-aside cache, and that's not changed. It would further simplify things if it was a look-through cache where the update happened while fetching any record and failing to find it, but we defer that to future work. A TODO outlining this will be introduced in a future commit. A note for ease of review: one structural change introduced in this diff is breaking down `ErrNoLivenessRecord` into `ErrMissingLivenessRecord` and `errLivenessRecordCacheMiss`. The former will be used in a future commit to generate better hints for users (it'll only ever surface when attempting to decommission/recommission non-existent nodes). The latter is used to represent cache misses. This too will be improved in a future commit, where instead of returning a specific error on cache access, we'll return a boolean instead. --- We don't intend to backport this to 20.2 due to the hazard described in \cockroachdb#54216. We want this PR to bake on master and (possibly) trip up the assertions added above if we've missed anything. They're the only ones checking for the invariant we've introduced around liveness records. That invariant will be depended on for long running migrations, so better to shake things out early. Release note: None
This TODO hopefully adds some clarity to the current code structure and suggests what it should look like going forward. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only comment worth highlighting is that we must not hold the mutex across the KV request.
Clarified below, but we aren't. Addressed the remaining comments, thanks for reviewing!
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz and @tbg)
pkg/kv/kvserver/node_liveness.go, line 278 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
No, this seems to be another one of these cases where the retry loop bailing out is not correctly terminated. Please fix!
Something like
if err := ctx.Err(); err != nil { return err } return errors.New("failed")should do the trick (the final return should be unreachable hopefully)
Done.
pkg/kv/kvserver/node_liveness.go, line 1225 at r6 (raw file):
I don't think we should be performing KV operations under nl.mu
We're not, heh. This is getLivenessLocked, not getLivenessRecordFromKV. The diff here is functionally the same as before (checking the nodes
map), except the empty check was pulled a level above (just to add the assertion I did in shouldReplaceLiveness).
pkg/kv/kvserver/node_liveness.go, line 611 at r9 (raw file):
Previously, tbg (Tobias Grieger) wrote…
At the very least, you'd hit it for looking up nodes that never existed.
You're really saying that you don't expect any callers to put in a nodeID here that they haven't learned from some liveness record (which sort-of implies that they got the liveness from our cache, etc)?
And then maybe you're saying this should return the "not found" error instead of the cache miss?
Clarify the TODO, please.
Done.
pkg/kv/kvserver/node_liveness.go, line 976 at r9 (raw file):
Previously, tbg (Tobias Grieger) wrote…
(or because the node does not exist)
Done.
pkg/kv/kvserver/node_liveness.go, line 988 at r9 (raw file):
Previously, tbg (Tobias Grieger) wrote…
(or because not does not exist)
Done.
pkg/kv/kvserver/node_liveness.go, line 1184 at r9 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Maybe link this to the other TODO since you're going to write more prose there? Here it's even less likely that we'll hit this path, maybe if we tried to decom a nonexisting node? But we'd bail out higher up.
Done. And yes, even less likely.
pkg/kv/kvserver/node_liveness_test.go, line 799 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Why'd you remove the inline comments? According to our style guide, they should remain.
heh, was just trying to limit the line width (and Goland shows what these args are); reverted.
17b74d5
to
5a1d2e2
Compare
Specifically around {d,r}ecommissioning non-existent nodes. We're able to do this now because we can always assume that a missing liveness record, as seen in the decommission/recommission codepaths, imply that the user is trying to decommission/recommission a non-existent node. Release note: None
5a1d2e2
to
a87c86e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 14 files at r2, 2 of 4 files at r13, 2 of 2 files at r15, 11 of 11 files at r16.
Dismissed @knz from a discussion.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif)
pkg/kv/kvserver/node_liveness.go, line 1225 at r6 (raw file):
Previously, irfansharif (irfan sharif) wrote…
I don't think we should be performing KV operations under nl.mu
We're not, heh. This is getLivenessLocked, not getLivenessRecordFromKV. The diff here is functionally the same as before (checking the
nodes
map), except the empty check was pulled a level above (just to add the assertion I did in shouldReplaceLiveness).
🤦 thanks
pkg/kv/kvserver/node_liveness.go, line 672 at r16 (raw file):
liveness, err := nl.getLivenessFromKV(ctx, nodeID) if err != nil { log.Infof(ctx, "unable to get liveness record: %s", err)
add "from KV"?
pkg/kv/kvserver/replica_range_lease.go, line 232 at r16 (raw file):
Existing: status.Lease, Requested: reqLease, Message: fmt.Sprintf("couldn't request lease for %+v: %v", nextLeaseHolder, errLivenessRecordCacheMiss),
Weird to reach out to the liveness-internal error here. But I guess it works.
pkg/kv/kvserver/replica_range_lease.go, line 556 at r16 (raw file):
ctx = r.AnnotateCtx(ctx) log.Warningf(ctx, "can't determine lease status of %s due to node liveness error: %+v", lease.Replica, errLivenessRecordCacheMiss)
Ditto
Specifically, apply the following diff (and propagate through callers): ```diff - getLivenessLocked(roachpb.NodeID) (LivenessRecord, error) + getLivenessLocked(roachpb.NodeID) (_ LivenessRecord, ok bool) ``` Previously there was only one possible error type returned, so just a drive-by simplification. Release note: None
a87c86e
to
785aea7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
bors r+
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @tbg)
pkg/kv/kvserver/node_liveness.go, line 672 at r16 (raw file):
Previously, tbg (Tobias Grieger) wrote…
add "from KV"?
Done.
pkg/kv/kvserver/replica_range_lease.go, line 232 at r16 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Weird to reach out to the liveness-internal error here. But I guess it works.
Going to keep these as is; I think it's more likely to run into these errors due to gossip propagation delays than it is due to us somehow trying to pass in a node ID that does not actually exist.
pkg/kv/kvserver/replica_range_lease.go, line 556 at r16 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Ditto
Ditto.
Build succeeded: |
Now that we have #53842, we maintain the invariant that there always
exists a liveness record for any given node. We can now simplify our
handling of liveness records internally: where previously we had code to
handle the possibility of empty liveness records (we created a new one
on the fly), we can change them to assertions to verify that's no longer
possible.
When retrieving the liveness record from our in-memory cache,
it's possible for us to not find anything due to gossip delays. Instead
of simply giving up then, now we can read the records directly from KV
(and update our caches while in the area). This PR introduces this
mechanism through usage of
getLivenessRecordFromKV
.Finally, as a bonus, this PR also surfaces a better error when trying to
decommission non-existent nodes. We're able to do this because now we
can always assume that a missing liveness record, as seen in the
decommission codepath, implies that the user is trying to decommission a
non-existent node.
We don't intend to backport this to 20.2 due to the hazard described in
#54216. We want this PR to bake on master and (possibly) trip up the
assertions added above if we've missed anything. They're the only ones
checking for the invariant we've introduced around liveness records.
That invariant will be depended on for long running migrations, so
better to shake things out early.
Release note: None