Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: add assertions for invariants around liveness records #54544

Merged
merged 8 commits into from
Sep 30, 2020

Conversation

irfansharif
Copy link
Contributor

@irfansharif irfansharif commented Sep 18, 2020

Now that we have #53842, we maintain the invariant that there always
exists a liveness record for any given node. We can now simplify our
handling of liveness records internally: where previously we had code to
handle the possibility of empty liveness records (we created a new one
on the fly), we can change them to assertions to verify that's no longer
possible.

When retrieving the liveness record from our in-memory cache,
it's possible for us to not find anything due to gossip delays. Instead
of simply giving up then, now we can read the records directly from KV
(and update our caches while in the area). This PR introduces this
mechanism through usage of getLivenessRecordFromKV.

Finally, as a bonus, this PR also surfaces a better error when trying to
decommission non-existent nodes. We're able to do this because now we
can always assume that a missing liveness record, as seen in the
decommission codepath, implies that the user is trying to decommission a
non-existent node.


We don't intend to backport this to 20.2 due to the hazard described in
#54216. We want this PR to bake on master and (possibly) trip up the
assertions added above if we've missed anything. They're the only ones
checking for the invariant we've introduced around liveness records.
That invariant will be depended on for long running migrations, so
better to shake things out early.

Release note: None

@irfansharif irfansharif requested a review from a team as a code owner September 18, 2020 01:11
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@irfansharif irfansharif force-pushed the 200917.liveness-cleanup branch from 11da600 to e8f6137 Compare September 18, 2020 01:18
@irfansharif irfansharif requested a review from tbg September 18, 2020 01:18
@irfansharif irfansharif changed the title kvserver: clean up empty liveness handling code kvserver: add assertions for invariants around up liveness records Sep 18, 2020
@irfansharif irfansharif changed the title kvserver: add assertions for invariants around up liveness records kvserver: add assertions for invariants around liveness records Sep 18, 2020
Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @tbg)


pkg/cli/error.go, line 364 at r1 (raw file):

		// Are we trying to recommission/decommision a node that does not exist?
		if strings.Contains(err.Error(), kvserver.ErrMissingLivenessRecord.Error()) {

not the right place for this check.
Say you're running a distsql query, and for some catastrophic reason a liveness record goes missing which causes an assertion error in SQL. We definitely want to see this in the output of the CLI command!

what you want is a check for this error specifically in the node (sub-)commands.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the current upgrade semantics (i.e. no long-running migrations), I think the old code is still necessary. If a node joins at 20.1 and is down throughout 20.2 but comes back at 21.1 (as unlikely as this is, it would require either weird recommissioning or manual version bump) then it may not have a liveness record. In practice I think this is exceedingly unlikely (outside of contrived tests) and when it occurs, we can work past it.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @tbg)


pkg/kv/kvserver/node_liveness.go, line 268 at r1 (raw file):

	for r := retry.StartWithCtx(ctx, base.DefaultRetryOptions()); r.Next(); {
		oldLivenessRec, err := nl.SelfEx()
		if err != nil && !errors.Is(err, errLivenessRecordCacheMiss) {

See other comment further below about this pattern.


pkg/kv/kvserver/node_liveness.go, line 633 at r1 (raw file):

					for r := retry.StartWithCtx(ctx, retryOpts); r.Next(); {
						oldLiveness, err := nl.Self()
						if err != nil && !errors.Is(err, errLivenessRecordCacheMiss) {

If you get an error here, it seems like you're still going to heartbeatInternal with a zero liveness. That does not seem like what you want any more. Also, the error handling here seems clunky. Why not

oldLiveness, err := nl.Self()
if errors.Is(err, errLivenessRecordCacheMiss) {
  oldLiveness, err = nl.getLivenessRecordFromKV(ctx, nodeID)
}
if err != nil {
  log.Infof(ctx, "unable to get liveness record: %s", err)
  continue
}

// ...

There's a similar pattern higher up in the review that should also get a second look.

@tbg tbg self-requested a review September 22, 2020 12:15
@irfansharif
Copy link
Contributor Author

With the current upgrade semantics (i.e. no long-running migrations), I think the old code is still necessary. If a node joins at 20.1 and is down throughout 20.2 but comes back at 21.1 (as unlikely as this is, it would require either weird recommissioning or manual version bump) then it may not have a liveness record.

Hm, that sucks. I wonder if the thing we want here for iron-clad guarantees is an RPC-level check gating the rando node from joining the cluster (separately it'd be a good idea to disallow RPCs from nodes with no liveness records in KV, cause that should never happen anyway going forward). If we had that kind of check which prevented this previously zombie node from re-joining the cluster, would that be acceptable? I'm also happy to just live with our earlier code, was just trying to sand down some sharp edges here motivated by nothing in particular (which evidently brings more by its own).

@tbg
Copy link
Member

tbg commented Sep 23, 2020

If you haven't had a node come live at all in 20.2 and are expecting it to run in 21.1, I think it's fine if that expectation is occasionally not going to pan out (I'm pretty sure this is not something we even support doing). Let's go ahead with this PR.

@irfansharif irfansharif force-pushed the 200917.liveness-cleanup branch 6 times, most recently from 2e78dcc to dea687f Compare September 24, 2020 03:45
Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL, I also re-worked the NodeLiveness.getLivenessLocked method signature (and propagated it through the callers):

-func (nl *NodeLiveness) getLivenessLocked(nodeID roachpb.NodeID) (LivenessRecord, error)
+func (nl *NodeLiveness) getLivenessLocked(nodeID roachpb.NodeID) (_ LivenessRecord, ok bool)

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @tbg)


pkg/cli/error.go, line 364 at r1 (raw file):

Previously, knz (kena) wrote…

not the right place for this check.
Say you're running a distsql query, and for some catastrophic reason a liveness record goes missing which causes an assertion error in SQL. We definitely want to see this in the output of the CLI command!

what you want is a check for this error specifically in the node (sub-)commands.

Done.


pkg/kv/kvserver/node_liveness.go, line 268 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

See other comment further below about this pattern.

Done.


pkg/kv/kvserver/node_liveness.go, line 633 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

If you get an error here, it seems like you're still going to heartbeatInternal with a zero liveness. That does not seem like what you want any more. Also, the error handling here seems clunky. Why not

oldLiveness, err := nl.Self()
if errors.Is(err, errLivenessRecordCacheMiss) {
  oldLiveness, err = nl.getLivenessRecordFromKV(ctx, nodeID)
}
if err != nil {
  log.Infof(ctx, "unable to get liveness record: %s", err)
  continue
}

// ...

There's a similar pattern higher up in the review that should also get a second look.

Done.

@irfansharif irfansharif force-pushed the 200917.liveness-cleanup branch from dea687f to d66a0f0 Compare September 24, 2020 15:24
@tbg tbg requested a review from knz September 24, 2020 15:45
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, the more I look at this code, the more questions I have about it (not your fault -the code was junk and you're merely alerting me to it).
However- since you're introducing a way to read from KV directly, I think we need to get that part right. Something still seems convoluted there, as you can see from my partially confused comments. It looks like the main method, GetLiveness(), does not fetch from KV; it just consults the cache. That is unexpected and as a result there's a lot of confusion about when we can assert on a record being returned from methods. You would think that we would default to populating the cache from KV and that not doing so amounts to "peeling back" the abstraction because you are an advanced internal user. I think by exploring this approach we'll significantly simplify the liveness code because the cache updating and db fetching will live in one code path that provides the foundation for the higher-level functions of the pkg (membership status, etc).

Btw, the signature changes look good, though they should be pulled out - this diff is already tricky to review because the code isn't in the best place.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @knz, and @tbg)


pkg/cli/node.go, line 344 at r2 (raw file):

	if err := runDecommissionNodeImpl(ctx, c, nodeCtx.nodeDecommissionWait, nodeIDs); err != nil {
		// Are we trying to recommission/decommision a node that does not exist?
		if strings.Contains(err.Error(), kvserver.ErrMissingLivenessRecord.Error()) {

We need the same in recommission, right?

Also, wouldn't it be better if we use a grpc error code here:

cockroach/pkg/server/server.go

Lines 1966 to 1969 in 532870f

statusChanged, err := s.nodeLiveness.SetMembershipStatus(ctx, nodeID, targetStatus)
if err != nil {
return err
}

if errors.Is(err, ErrMissingLivenessRecord) and then on the client taylor the message based on the error code?

That would seem more idiomatic: avoids the string matching, cli->kvserver dependency (which goes across possibly multiple versions).


pkg/jobs/registry.go, line 790 at r2 (raw file):

	if !ok {
		if nodeLivenessLogLimiter.ShouldLog() {
			log.Warning(ctx, "unable to get node liveness")

"own liveness record not found"?


pkg/kv/kvserver/node_liveness.go, line 274 at r2 (raw file):

			livenessRec, err := nl.getLivenessRecordFromKV(ctx, nodeID)
			if err != nil {
				log.Infof(ctx, "unable to get liveness record: %s", err)

add "from KV"


pkg/kv/kvserver/node_liveness.go, line 396 at r2 (raw file):

	if oldLivenessRec.Liveness == (kvserverpb.Liveness{}) {
		log.Fatal(ctx, "invalid old liveness record; found to be empty")

Return an error here and everywhere else we assert on this stuff. I was thinking about the assertions a bit more and it seems that in the unlikely case in which they fire, they would not necessarily fire on the node that actually has the missing record (i.e. the one that missed a release). We definitely want to avoid that.


pkg/kv/kvserver/node_liveness.go, line 516 at r2 (raw file):

) (statusChanged bool, err error) {
	if oldLivenessRec.Liveness == (kvserverpb.Liveness{}) {
		log.Fatal(ctx, "invalid old liveness record; found to be empty")

return err


pkg/kv/kvserver/node_liveness.go, line 575 at r2 (raw file):

	if !ok {
		// TODO(irfansharif): Is this conditional possible given we're always
		// persisting the liveness record?

It should not, right? GetLiveness hits the db when it needs to?

It looks to me as though you should be returning an assertion failed error here (errors.AssertionFailedf, just like for the other assertions)


pkg/kv/kvserver/node_liveness.go, line 954 at r2 (raw file):

func (nl *NodeLiveness) getLivenessLocked(nodeID roachpb.NodeID) (_ LivenessRecord, ok bool) {
	if l, ok := nl.mu.nodes[nodeID]; ok {
		return l, true

Oh, huh, I guess it doesn't get it from KV. That's because you want to avoid circular dependencies, right? Like checking the liveness to get lease on the liveness table leading to reading from the liveness table?
Something here is still wonky. You'd expect querying the cache directly to be the norm and that only at the range level (where you worry about the circularity) we consult the cache.

@tbg tbg self-requested a review September 24, 2020 15:45
@irfansharif
Copy link
Contributor Author

The "look-aside cache" structure here was intentional, though I could've communicated that better. My thinking was that I wanted a more easily verifiable progression between where we are today (which is this confusing look-aside cache structure) and where we want to be (a look-through cache, probably in future refactoring).

Let me try to re-explain. The intended "diff" this PR introduces is probably best exemplified by the following snippet:

@@ -631,9 +625,15 @@ func (nl *NodeLiveness) StartHeartbeat(
                                func(ctx context.Context) error {
                                        // Retry heartbeat in the event the conditional put fails.
                                        for r := retry.StartWithCtx(ctx, retryOpts); r.Next(); {
-                                               oldLiveness, err := nl.Self()
-                                               if err != nil && !errors.Is(err, ErrNoLivenessRecord) {
-                                                       log.Errorf(ctx, "unexpected error getting liveness: %+v", err)
+                                               oldLiveness, ok := nl.Self()
+                                               if !ok {
+                                                       nodeID := nl.gossip.NodeID.Get()
+                                                       liveness, err := nl.getLivenessFromKV(ctx, nodeID)
+                                                       if err != nil {
+                                                               log.Infof(ctx, "unable to get liveness record: %s", err)
+                                                               continue
+                                                       }
+                                                       oldLiveness = liveness
                                                }
                                                if err := nl.heartbeatInternal(ctx, oldLiveness, incrementEpoch); err != nil {
                                                        if errors.Is(err, ErrEpochIncremented) {

Previously we did this funky thing where we consulted our look-aside cache (nl.Self), and simply made use of the zero-value (kvserverpb.Liveness{}) in all subsequent code (within heartbeatInternal). With this PR, we do the same thing looking at our look-aside cache, but if we don't find anything, we'll now go fetch the value from KV. As for updating our cache to now store this record we just fetched, for now I'm letting it happen the same way it did before (within maybeUpdate, after we've CPut-ed our new liveness record). I was trying to avoid pulling the "cache update" codepath up above where we access it (thus making it a look-through cache) to keep this diff a bit more self-contained. Cache updates currently also trigger attached callbacks, and I'm not sure that we always want to call those callbacks when we're merely populating our cache for the first time, rather than "updating" it.


I'll try separating out the signature diff and making the error changes, plus add more clarifying commentary for what future progression for this NodeLiveness guy should look like (tl;dr use a look-through cache instead of our current look-aside one).

@irfansharif
Copy link
Contributor Author

irfansharif commented Sep 24, 2020

Return an error here and everywhere else we assert on this stuff. I was thinking about the assertions a bit more and it seems that in the unlikely case in which they fire, they would not necessarily fire on the node that actually has the missing record (i.e. the one that missed a release). We definitely want to avoid that.

Hm, I don't think I quite understand the distinction you're drawing between simply Fatal-ing away vs. returning assertion failure errors. Looking at our existing code, we have these "blind retry on error" loops today, like below:

if err != nil {
if log.V(1) {
log.Infof(ctx, "attempting to set liveness draining status to %v: %v", drain, err)
}
continue

These would just swallow any assertion failure errors we'd want to return. Given these are are assertion failures, I also don't know that we should worry about where they're firing, just that they are firing. I think it also simplifies the code a bit to simply Fatal away instead of returning a possibly swallowed error.

@irfansharif irfansharif force-pushed the 200917.liveness-cleanup branch from d66a0f0 to 17b74d5 Compare September 24, 2020 23:10
@irfansharif
Copy link
Contributor Author

I'll try separating out the signature diff and making the error changes, plus add more clarifying commentary for what future progression for this NodeLiveness guy should look like (tl;dr use a look-through cache instead of our current look-aside one).

Done.

You would think that we would default to populating the cache from KV and that not doing so amounts to "peeling back" the abstraction because you are an advanced internal user. I think by exploring this approach we'll significantly simplify the liveness code because the cache updating and db fetching will live in one code path that provides the foundation for the higher-level functions of the pkg (membership status, etc).

I completely agree. My primary intent with this PR was to simply add these assertions in to try and to prove to ourselves that liveness records do in fact, always exist. I'm reasonably comfortable with that now that this goes through CI just fine, and the few test cluster I've spun up have yet to fall over. So for now I'm a bit inclined to live with this "peeled back" abstraction for a while. I think it's no worse than our previous state of things (and hopefully a bit better having removed empty liveness handling code).

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for reworking the commit history! This was great to review (despite reviewable being a little confused at this point). I think we're on home stretch here. The only comment worth highlighting is that we must not hold the mutex across the KV request.

Reviewed 14 of 14 files at r3, 3 of 3 files at r4, 2 of 2 files at r5, 4 of 4 files at r6, 1 of 1 files at r7, 2 of 2 files at r8, 11 of 11 files at r9.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @knz)


pkg/kv/kvserver/node_liveness.go, line 278 at r4 (raw file):

	// best effort attempt at marking the node as draining. It can return
	// without an error after trying unsuccessfully a few times. Is that what we
	// want?

No, this seems to be another one of these cases where the retry loop bailing out is not correctly terminated. Please fix!

Something like

if err := ctx.Err(); err != nil { return err }
return errors.New("failed")

should do the trick (the final return should be unreachable hopefully)


pkg/kv/kvserver/node_liveness.go, line 1225 at r6 (raw file):

	var shouldReplace bool
	nl.mu.Lock()
	oldLivenessRec, err := nl.getLivenessLocked(newLivenessRec.NodeID)

I don't think we should be performing KV operations under nl.mu. Please make sure we never do that.

As a consequence, I think we should be going through shouldReplaceLiveness in every case (i.e. not set it to true here).


pkg/kv/kvserver/node_liveness.go, line 611 at r9 (raw file):

	liveness, ok := nl.GetLiveness(nodeID)
	if !ok {
		// TODO(irfansharif): Is this conditional possible given we're always

At the very least, you'd hit it for looking up nodes that never existed.

You're really saying that you don't expect any callers to put in a nodeID here that they haven't learned from some liveness record (which sort-of implies that they got the liveness from our cache, etc)?

And then maybe you're saying this should return the "not found" error instead of the cache miss?

Clarify the TODO, please.


pkg/kv/kvserver/node_liveness.go, line 976 at r9 (raw file):

// GetLiveness returns the liveness record for the specified nodeID. If the
// liveness record is not found (due to gossip propagation delays), we surface

(or because the node does not exist)


pkg/kv/kvserver/node_liveness.go, line 988 at r9 (raw file):

// getLivenessLocked returns the liveness record for the specified nodeID,
// consulting the in-memory cache. If nothing is found (could happen due to
// gossip propagation delays), we surface that to the caller.

(or because not does not exist)


pkg/kv/kvserver/node_liveness.go, line 1184 at r9 (raw file):

		l, ok := nl.GetLiveness(update.newLiveness.NodeID)
		if !ok {
			// TODO(irfansharif): Is this conditional possible given we're

Maybe link this to the other TODO since you're going to write more prose there? Here it's even less likely that we'll hit this path, maybe if we tried to decom a nonexisting node? But we'd bail out higher up.


pkg/kv/kvserver/node_liveness_test.go, line 799 at r4 (raw file):

	}

	if err := mtc.nodeLivenesses[drainingNodeIdx].SetDraining(ctx, true, nil); err != nil {

Why'd you remove the inline comments? According to our style guide, they should remain.

We'll need it in a future commit. Not returning an error masked an
existing bug (which we leave unfixed for now) where it was possible for
us to be unable to mark the target node as draining and still proceed.
The usage of SetDraining does not assume it to be a best-effort attempt.

Release note: None
In an earlier refactor we observed that it was possible to return
successfully from a node drain attempt despite not having fully drained
the target node. This made the `cockroach node drain` invocation a best
effort attempt, not a guaranteed one, which was not the intent. We
return the appropriate failure now.

Release note: None
In a future commit we'll introduce assertions that this is no longer
possible. Let's update our tests to be more representative of inputs
we'd normally expect.

Release note: None
Now that we have cockroachdb#53842, we maintain the invariant that there always
exists a liveness record for any given node. We can now simplify our
handling of liveness records internally: where previously we had code to
handle the possibility of empty liveness records (we created a new one
on the fly), we change them to assertions that verify that empty
liveness records are no longer flying around in the system.

When retrieving the liveness record from our in-memory cache,
it was possible for us to not find anything due to gossip delays.
Instead of simply giving up then, now we can read the records directly
from KV (and evebtually update our caches to store this newly read
record). This PR introduces this mechanism through usage of
`getLivenessRecordFromKV`. We should note that the existing cache
structure within NodeLiveness is a look-aside cache, and that's not
changed. It would further simplify things if it was a look-through cache
where the update happened while fetching any record and failing to find
it, but we defer that to future work. A TODO outlining this will be
introduced in a future commit.

A note for ease of review: one structural change introduced in this diff
is breaking down `ErrNoLivenessRecord` into `ErrMissingLivenessRecord` and
`errLivenessRecordCacheMiss`. The former will be used in a future commit
to generate better hints for users (it'll only ever surface when
attempting to decommission/recommission non-existent nodes). The latter
is used to represent cache misses. This too will be improved in a future
commit, where instead of returning a specific error on cache access,
we'll return a boolean instead.

---

We don't intend to backport this to 20.2 due to the hazard described in
\cockroachdb#54216. We want this PR to bake on master and (possibly) trip up the
assertions added above if we've missed anything. They're the only ones
checking for the invariant we've introduced around liveness records.
That invariant will be depended on for long running migrations, so
better to shake things out early.

Release note: None
This TODO hopefully adds some clarity to the current code structure and
suggests what it should look like going forward.

Release note: None
Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only comment worth highlighting is that we must not hold the mutex across the KV request.

Clarified below, but we aren't. Addressed the remaining comments, thanks for reviewing!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @tbg)


pkg/kv/kvserver/node_liveness.go, line 278 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

No, this seems to be another one of these cases where the retry loop bailing out is not correctly terminated. Please fix!

Something like

if err := ctx.Err(); err != nil { return err }
return errors.New("failed")

should do the trick (the final return should be unreachable hopefully)

Done.


pkg/kv/kvserver/node_liveness.go, line 1225 at r6 (raw file):

I don't think we should be performing KV operations under nl.mu

We're not, heh. This is getLivenessLocked, not getLivenessRecordFromKV. The diff here is functionally the same as before (checking the nodes map), except the empty check was pulled a level above (just to add the assertion I did in shouldReplaceLiveness).


pkg/kv/kvserver/node_liveness.go, line 611 at r9 (raw file):

Previously, tbg (Tobias Grieger) wrote…

At the very least, you'd hit it for looking up nodes that never existed.

You're really saying that you don't expect any callers to put in a nodeID here that they haven't learned from some liveness record (which sort-of implies that they got the liveness from our cache, etc)?

And then maybe you're saying this should return the "not found" error instead of the cache miss?

Clarify the TODO, please.

Done.


pkg/kv/kvserver/node_liveness.go, line 976 at r9 (raw file):

Previously, tbg (Tobias Grieger) wrote…

(or because the node does not exist)

Done.


pkg/kv/kvserver/node_liveness.go, line 988 at r9 (raw file):

Previously, tbg (Tobias Grieger) wrote…

(or because not does not exist)

Done.


pkg/kv/kvserver/node_liveness.go, line 1184 at r9 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Maybe link this to the other TODO since you're going to write more prose there? Here it's even less likely that we'll hit this path, maybe if we tried to decom a nonexisting node? But we'd bail out higher up.

Done. And yes, even less likely.


pkg/kv/kvserver/node_liveness_test.go, line 799 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Why'd you remove the inline comments? According to our style guide, they should remain.

heh, was just trying to limit the line width (and Goland shows what these args are); reverted.

@irfansharif irfansharif force-pushed the 200917.liveness-cleanup branch from 17b74d5 to 5a1d2e2 Compare September 29, 2020 20:48
Specifically around {d,r}ecommissioning non-existent nodes.
We're able to do this now because we can always assume that a missing
liveness record, as seen in the decommission/recommission codepaths,
imply that the user is trying to decommission/recommission a
non-existent node.

Release note: None
@irfansharif irfansharif force-pushed the 200917.liveness-cleanup branch from 5a1d2e2 to a87c86e Compare September 29, 2020 20:59
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 14 files at r2, 2 of 4 files at r13, 2 of 2 files at r15, 11 of 11 files at r16.
Dismissed @knz from a discussion.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif)


pkg/kv/kvserver/node_liveness.go, line 1225 at r6 (raw file):

Previously, irfansharif (irfan sharif) wrote…

I don't think we should be performing KV operations under nl.mu

We're not, heh. This is getLivenessLocked, not getLivenessRecordFromKV. The diff here is functionally the same as before (checking the nodes map), except the empty check was pulled a level above (just to add the assertion I did in shouldReplaceLiveness).

🤦 thanks


pkg/kv/kvserver/node_liveness.go, line 672 at r16 (raw file):

							liveness, err := nl.getLivenessFromKV(ctx, nodeID)
							if err != nil {
								log.Infof(ctx, "unable to get liveness record: %s", err)

add "from KV"?


pkg/kv/kvserver/replica_range_lease.go, line 232 at r16 (raw file):

				Existing:  status.Lease,
				Requested: reqLease,
				Message:   fmt.Sprintf("couldn't request lease for %+v: %v", nextLeaseHolder, errLivenessRecordCacheMiss),

Weird to reach out to the liveness-internal error here. But I guess it works.


pkg/kv/kvserver/replica_range_lease.go, line 556 at r16 (raw file):

					ctx = r.AnnotateCtx(ctx)
					log.Warningf(ctx, "can't determine lease status of %s due to node liveness error: %+v",
						lease.Replica, errLivenessRecordCacheMiss)

Ditto

Specifically, apply the following diff (and propagate through callers):

```diff
- getLivenessLocked(roachpb.NodeID) (LivenessRecord, error)
+ getLivenessLocked(roachpb.NodeID) (_ LivenessRecord, ok bool)
```

Previously there was only one possible error type returned, so just a
drive-by simplification.

Release note: None
@irfansharif irfansharif force-pushed the 200917.liveness-cleanup branch from a87c86e to 785aea7 Compare September 30, 2020 14:29
Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @tbg)


pkg/kv/kvserver/node_liveness.go, line 672 at r16 (raw file):

Previously, tbg (Tobias Grieger) wrote…

add "from KV"?

Done.


pkg/kv/kvserver/replica_range_lease.go, line 232 at r16 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Weird to reach out to the liveness-internal error here. But I guess it works.

Going to keep these as is; I think it's more likely to run into these errors due to gossip propagation delays than it is due to us somehow trying to pass in a node ID that does not actually exist.


pkg/kv/kvserver/replica_range_lease.go, line 556 at r16 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Ditto

Ditto.

@craig
Copy link
Contributor

craig bot commented Sep 30, 2020

Build succeeded:

@craig craig bot merged commit 1ac75dd into cockroachdb:master Sep 30, 2020
@irfansharif irfansharif deleted the 200917.liveness-cleanup branch September 30, 2020 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants