server: always create a liveness record before starting up #53842

irfansharif · 2020-09-02T17:43:30Z

Previously it used to be the case that it was possible for a node to be
up and running, and for there to be no corresponding liveness record for
it. This was a very transient situation as liveness records are created
for a given node as soon as it out its first heartbeat. Still, given
that this could take a few seconds, it lent to a lot of complexity in
our handling of node liveness where we had to always anticipate the
possibility of there being no corresponding liveness record for a given
node (and thus creating it if necessary).

Having a liveness record for each node always present is a crucial
building block for long running migrations (#48843). There the intention
is to have the orchestrator process look towards the list of liveness
records for an authoritative view of cluster membership. Previously when
it was possible for an active member of the cluster to not have a
corresponding liveness record (no matter how unlikely or short-lived in
practice), we could not generate such a view.

This is an alternative implementation for #53805. Here we choose to
manually write the liveness record for the bootstrapping node when
writing initial cluster data. For all other nodes, we do it on the
server-side of the join RPC. We're also careful to do it in the legacy
codepath when joining a cluster through gossip.

Release note: None

cockroach-teamcity · 2020-09-02T17:43:38Z

This change is

irfansharif · 2020-09-02T22:29:26Z

+cc @tbg, mind taking a look here as well? I'm still not sure if there's a simpler workaround for the --locality-advertise-addr issue, I feel like I must be missing something obvious. Though that said, writing to the store directly during bootstrap would work around that problem if nothing else is possible. See comments in the branch around epoch={0,2}, wanted your thoughts there.

irfansharif · 2020-09-08T16:14:01Z

We probably don't care about having the liveness epoch=2 for new nodes, so I'll remove the "initialize at epoch=0" change here.

tbg

This came out well!

Reviewed 8 of 8 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif)

pkg/kv/kvserver/node_liveness.go, line 461 at r1 (raw file):

// records after starting up, and incrementing to epoch=2 when doing so, at
// which point we'll set an appropriate expiration timestamp, gossip the
// liveness record, and update our in-memory representation of it.

Add/reword that an existing liveness record would not be overwritten but instead an error returned from this method.

pkg/kv/kvserver/node_liveness_test.go, line 174 at r1 (raw file):

	defer tc.Stopper().Stop(ctx)

	// Verify liveness records exist for all nodes.

So at this point, something in StartTestCluster has waited for all nodes to become live (otherwise this test is just flaky)? Add a comment.

pkg/server/node.go, line 383 at r1 (raw file):

		// We're joining via gossip, so we don't have a liveness record for
		// ourselves yet. Let's create one while here.
		if err := n.storeCfg.NodeLiveness.CreateLivenessRecord(ctx, nodeID); err != nil {

Just reasoning this out to myself: we're not in the case in which "we are the cluster" because that only happens when this is the node that just got bootstrapped, but in that case it has a nodeID.

pkg/server/node.go, line 1160 at r1 (raw file):

	// We create a liveness record here for the joining node while here. This
	// way nodes are always guaranteed to have a liveness record present before
	// fully starting up.

Add that this is required for long-running migrations correctness. (This might be a place where folks in the future might muck with this stuff and not know why it was the way it is now in the first place).

irfansharif

TFTR! CI had failed on #54079, which is a bit strange (+ funny). I think something else is going on there, and I'm curious to know what. I was also missing an update to some logic test.

I'm a bit sad about not being able to use epoch=0, for no good reason. I think it's still possible to do so safely. I'll merge this when green, and might try lowering it again on master alone just to get back to the previous behavior.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)

pkg/kv/kvserver/node_liveness.go, line 461 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Add/reword that an existing liveness record would not be overwritten but instead an error returned from this method.

Done.

pkg/kv/kvserver/node_liveness_test.go, line 174 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

So at this point, something in StartTestCluster has waited for all nodes to become live (otherwise this test is just flaky)? Add a comment.

Done.

pkg/server/node.go, line 1160 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Add that this is required for long-running migrations correctness. (This might be a place where folks in the future might muck with this stuff and not know why it was the way it is now in the first place).

Done.

irfansharif · 2020-09-10T01:55:09Z

bors r+

irfansharif · 2020-09-10T03:20:29Z

bors r-

craig · 2020-09-10T03:20:31Z

Canceled.

irfansharif · 2020-09-10T03:39:45Z

bors r+

craig · 2020-09-10T04:34:29Z

Build failed (retrying...):

GitHub CI (Cockroach)

irfansharif · 2020-09-10T04:39:05Z

bors r-

craig · 2020-09-10T04:39:07Z

Canceled.

irfansharif · 2020-09-10T06:02:22Z

I'm a bit sad about not being able to use epoch=0, for no good reason. I think it's still possible to do so safely. I'll merge this when green, and might try lowering it again on master alone just to get back to the previous behavior.

Hm, looks like I'll actually need to go with the epoch=0 initial write. Here's why: the existing node liveness tests use multiTestContext and all its quirks (I made use of TestCluster in the test I just added). One of the quirks is that it makes do without using Servers. Which also means it makes do without using the join RPC added in #52526. As a result, all nodes except for the bootstrap node will not actually have a liveness record persisted in KV.

Let's say we write our initial liveness record at epoch=1. For the bootstrap node, given the NodeLiveness component starts off knowing about this liveness record, we'll increment it to epoch=2 during our first heartbeat.

cockroach/pkg/kv/kvserver/node_liveness.go

Lines 761 to 763 in d0ead9a

    
           newLiveness = oldLiveness 
        
           if incrementEpoch { 
        
           	newLiveness.Epoch++

For all other nodes, given we don't already have a liveness record persisted, after our first heartbeat we'll find ourselves at epoch=1.

cockroach/pkg/kv/kvserver/node_liveness.go

Lines 756 to 758 in d0ead9a

    
           newLiveness = kvserverpb.Liveness{ 
        
           	NodeID: nodeID, 
        
           	Epoch:  1,

The tests that want to assert that epoch=1 for all liveness records, without having seen restarts, will not be too happy about this discrepancy.

cockroach/pkg/kv/kvserver/node_liveness_test.go

Lines 152 to 154 in d0ead9a

    
           if liveness.Epoch != 1 { 
        
           	t.Errorf("expected epoch to be set to 1 initially; got %d", liveness.Epoch) 
        
           }

cockroach/pkg/kv/kvserver/node_liveness_test.go

Lines 613 to 615 in d0ead9a

    
           if a, e := l.Epoch, int64(1); a != e { 
        
           	t.Errorf("liveness record had epoch %d, wanted %d", a, e) 
        
           }

This is another instance where having #8299 would have come in handy. I went ahead and kept the initial write to be at epoch=0 so we retain the previous epoch=1 pattern for very first start. That way at least the node liveness tests that make use of multiTestContext pass. As for the changes introduced in this PR, TestNodeLivenessAppearsAtStart sufficiently exercises those code paths.

I can try and follow up this month/next with migrating some of these tests over to use TestCluster (partially addressing #8299), but I'm inclined to not hold this PR up for it. I'll bors it once you take another look, @tbg.

irfansharif · 2020-09-10T08:46:27Z

One of the quirks is that it makes do without using Servers. Which also means it makes do without using the join RPC added in #52526. As a result, all nodes except for the bootstrap node will not actually have a liveness record persisted in KV.

I suppose it's fine for me to manually write liveness records as part of multiTestContext.Start, which I just did. Doh. I still prefer starting off at epoch=0. I'm not too worried about the zero value of the liveness record because it isn't the zero value, the node ID is set. My motivation is primarily to retain epoch=1 semantics that we have today. But I can change it out again if asked.

tbg · 2020-09-10T08:57:12Z

I'm taking a look now, but already wanted to let you know that I'm fine writing at epoch zero. My initial inclination to avoid that was because you mentioned something tricky resulted from it (or so I remember). If that is not the case - perfect, epoch zero it is. Looking now through the code though.

tbg

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @tbg)

pkg/kv/kvserver/node_liveness.go, line 797 at r2 (raw file):

Previously, irfansharif (irfan sharif) wrote…

I don't think there are any migration concerns. By this phrasing I meant that this NodeLiveness guy hasn't yet discovered its liveness record (so the in-memory cache of liveness records needs updating).

In fact, if we did manually initialize here, it would actually do the same thing as we're doing explicitly. If we declared our oldLiveness to be empty, in trying to create a newLiveness our CPut attempt on the record below would fail. That would end up updating our in-memory cache with what was already found within KV, after which we'd have to retry anyway. I think writing it out in the way I did is a bit clearer (to me).

Separately, I think we actually want the error here to act as an assertion. Between us persisting a record during bootstrap, through the join rpc, and gossip code paths, there really should never be a missing liveness record once we get here. I'm inclined to let this bake on master for a bit as is to see what shakes out, and backporting to 20.2 after.

If a 20.1 node joins the cluster (at 20.1) and gets killed before persisting its liveness record, and then the cluster starts running 20.2 binaries, and the node comes back with the code of this PR, won't it eternally be unable to heartbeat itself? It's not going to get its liveness record created by the join rpc (it's not joining) nor bootstrap. I agree that this is sort of rare but if it's possible - we ought to handle it. What am I missing?

irfansharif · 2020-09-10T21:05:16Z

Good catch, hopefully the last time we (I) have to think about missing liveness records. Added it in #54216.

In cockroachdb#53842 we introduced a change to always persist a liveness record on start up. As part of that change, we refactored how the liveness heartbeat codepath dealt with missing liveness records: it knew to fetch it from KV given we were now maintaining the invariant that it would always be present. Except that wasn't necessarily true, as demonstrated by the following scenario: ``` // - v20.1 node gets added to v20.1 cluster, and is quickly removed // before being able to persist its liveness record. // - The cluster is upgraded to v20.2. // - The node from earlier is rolled into v20.2, and re-added to the // cluster. // - It's never able to successfully heartbeat (it didn't join // through the join rpc, bootstrap, or gossip). Welp. ``` Though admittedly unlikely, we should handle it all the same instead of simply erroring out. We'll just fall back to creating the liveness record in-place as we did in v20.1 code. We can remove this fallback in 21.1 code. Release note: None

54216: kvserver: address migration concern with node liveness r=irfansharif a=irfansharif In #53842 we introduced a change to always persist a liveness record on start up. As part of that change, we refactored how the liveness heartbeat codepath dealt with missing liveness records: it knew to fetch it from KV given we were now maintaining the invariant that it would always be present. Except that wasn't necessarily true, as demonstrated by the following scenario: ``` // - v20.1 node gets added to v20.1 cluster, and is quickly removed // before being able to persist its liveness record. // - The cluster is upgraded to v20.2. // - The node from earlier is rolled into v20.2, and re-added to the // cluster. // - It's never able to successfully heartbeat (it didn't join // through the join rpc, bootstrap, or gossip). Welp. ``` Though admittedly unlikely, we should handle it all the same instead of simply erroring out. We'll just fall back to creating the liveness record in-place as we did in v20.1 code. We can remove this fallback in 21.1 code. --- First commit is from #54224. Release note: None Co-authored-by: irfan sharif <[email protected]>

In cockroachdb#53842 we introduced a change to always persist a liveness record on start up. As part of that change, we refactored how the liveness heartbeat codepath dealt with missing liveness records: it knew to fetch it from KV given we were now maintaining the invariant that it would always be present. Except that wasn't necessarily true, as demonstrated by the following scenario: ``` // - v20.1 node gets added to v20.1 cluster, and is quickly removed // before being able to persist its liveness record. // - The cluster is upgraded to v20.2. // - The node from earlier is rolled into v20.2, and re-added to the // cluster. // - It's never able to successfully heartbeat (it didn't join // through the join rpc, bootstrap, or gossip). Welp. ``` Though admittedly unlikely, we should handle it all the same instead of simply erroring out. We'll just fall back to creating the liveness record in-place as we did in v20.1 code. We can remove this fallback in 21.1 code. Release note: None

Now that we have cockroachdb#53842, we maintain the invariant that there always exists a liveness record for any given node. We can now simplify our handling of liveness records internally: where previously we had code to handle the possibility of empty liveness records (we created a new one on the fly), we can change them to assertions to verify that's no longer possible. When retrieving the liveness record from our in-memory cache, it's possible for us to not find anything due to gossip delays. Instead of simply giving up then, now we can read the records directly from KV (and update our caches while in the area). This PR introduces this mechanism through usage of `getLivenessRecordFromKV`. Finally, as a bonus, this PR also surfaces a better error when trying to decommission non-existent nodes. We're able to do this because now we can always assume that a missing liveness record, as seen in the decommission codepath, implies that the user is trying to decommission a non-existent node. --- We don't intend to backport this to 20.2 due to the hazard described in \cockroachdb#54216. We want this PR to bake on master and (possibly) trip up the assertions added above if we've missed anything. They're the only ones checking for the invariant we've introduced around liveness records. That invariant will be depended on for long running migrations, so better to shake things out early. Release note: None

Now that we have cockroachdb#53842, we maintain the invariant that there always exists a liveness record for any given node. We can now simplify our handling of liveness records internally: where previously we had code to handle the possibility of empty liveness records (we created a new one on the fly), we change them to assertions that verify that empty liveness records are no longer flying around in the system. When retrieving the liveness record from our in-memory cache, it was possible for us to not find anything due to gossip delays. Instead of simply giving up then, now we can read the records directly from KV (and evebtually update our caches to store this newly read record). This PR introduces this mechanism through usage of `getLivenessRecordFromKV`. We should note that the existing cache structure within NodeLiveness is a look-aside cache, and that's not changed. It would further simplify things if it was a look-through cache where the update happened while fetching any record and failing to find it, but we defer that to future work. A TODO outlining this will be introduced in a future commit. A note for ease of review: one structural change introduced in this diff is breaking down `ErrNoLivenessRecord` into `ErrMissingLivenessRecord` and `errLivenessRecordCacheMiss`. The former will be used in a future commit to generate better hints for users (it'll only ever surface when attempting to decommission/recommission non-existent nodes). The latter is used to represent cache misses. This too will be improved in a future commit, where instead of returning a specific error on cache access, we'll return a boolean instead. --- We don't intend to backport this to 20.2 due to the hazard described in \cockroachdb#54216. We want this PR to bake on master and (possibly) trip up the assertions added above if we've missed anything. They're the only ones checking for the invariant we've introduced around liveness records. That invariant will be depended on for long running migrations, so better to shake things out early. Release note: None

54544: kvserver: add assertions for invariants around liveness records r=irfansharif a=irfansharif Now that we have #53842, we maintain the invariant that there always exists a liveness record for any given node. We can now simplify our handling of liveness records internally: where previously we had code to handle the possibility of empty liveness records (we created a new one on the fly), we can change them to assertions to verify that's no longer possible. When retrieving the liveness record from our in-memory cache, it's possible for us to not find anything due to gossip delays. Instead of simply giving up then, now we can read the records directly from KV (and update our caches while in the area). This PR introduces this mechanism through usage of `getLivenessRecordFromKV`. Finally, as a bonus, this PR also surfaces a better error when trying to decommission non-existent nodes. We're able to do this because now we can always assume that a missing liveness record, as seen in the decommission codepath, implies that the user is trying to decommission a non-existent node. --- We don't intend to backport this to 20.2 due to the hazard described in \#54216. We want this PR to bake on master and (possibly) trip up the assertions added above if we've missed anything. They're the only ones checking for the invariant we've introduced around liveness records. That invariant will be depended on for long running migrations, so better to shake things out early. Release note: None 54812: docker: Base the docker image on RedHat UBI r=bdarnell,DuskEagle a=jlinder Before: The docker image was based on Debian 9.12 slim. Why: This change will help on-prem customers from a security and compliance perspective. It also aligns with our publishing images into the RedHat Marketplace. Now: Published docker images are based on the RedHat UBI 8 base image. Fixes: #49643 Release note (backward-incompatible change): CockroachDB Docker images are now based on the RedHat ubi8/ubi base image instead of Debian 9.12 slim. This will help on-prem customers from a security and compliance perspective. Co-authored-by: irfan sharif <[email protected]> Co-authored-by: James H. Linder <[email protected]>

Now that we have #53842, we maintain the invariant that there always exists a liveness record for any given node. We can now simplify our handling of liveness records internally: where previously we had code to handle the possibility of empty liveness records (we created a new one on the fly), we change them to assertions that verify that empty liveness records are no longer flying around in the system. When retrieving the liveness record from our in-memory cache, it was possible for us to not find anything due to gossip delays. Instead of simply giving up then, now we can read the records directly from KV (and evebtually update our caches to store this newly read record). This PR introduces this mechanism through usage of `getLivenessRecordFromKV`. We should note that the existing cache structure within NodeLiveness is a look-aside cache, and that's not changed. It would further simplify things if it was a look-through cache where the update happened while fetching any record and failing to find it, but we defer that to future work. A TODO outlining this will be introduced in a future commit. A note for ease of review: one structural change introduced in this diff is breaking down `ErrNoLivenessRecord` into `ErrMissingLivenessRecord` and `errLivenessRecordCacheMiss`. The former will be used in a future commit to generate better hints for users (it'll only ever surface when attempting to decommission/recommission non-existent nodes). The latter is used to represent cache misses. This too will be improved in a future commit, where instead of returning a specific error on cache access, we'll return a boolean instead. --- We don't intend to backport this to 20.2 due to the hazard described in \#54216. We want this PR to bake on master and (possibly) trip up the assertions added above if we've missed anything. They're the only ones checking for the invariant we've introduced around liveness records. That invariant will be depended on for long running migrations, so better to shake things out early. Release note: None

Since cockroachdb#53842 we write an initial liveness record with a zero timestamp. This sometimes shows up as the following flake. ``` logic.go:2283: testdata/logic_test/crdb_internal:356: SELECT node_id, regexp_replace(epoch::string, '^\d+$', '<epoch>') as epoch, regexp_replace(expiration, '^\d+\.\d+,\d+$', '<timestamp>') as expiration, draining, decommissioning, membership FROM crdb_internal.gossip_liveness WHERE node_id = 1 expected: node_id epoch expiration draining decommissioning membership 1 <epoch> <timestamp> false false active but found (query options: "colnames") : node_id epoch expiration draining decommissioning membership 1 <epoch> 0,0 false false active ``` Release note: None

irfansharif force-pushed the 200901.liveness-join branch 2 times, most recently from 07ed6be to 2f908bd Compare September 2, 2020 22:26

irfansharif force-pushed the 200901.liveness-join branch from 2f908bd to b976f92 Compare September 9, 2020 03:32

irfansharif requested a review from tbg September 9, 2020 03:33

irfansharif changed the title ~~[wip,dnr,dnm] server,kvserver: create liveness records early in node lifecycle~~ server: always create a liveness record before starting up Sep 9, 2020

irfansharif mentioned this pull request Sep 9, 2020

server: always create a liveness record before starting up #53805

Closed

tbg approved these changes Sep 9, 2020

View reviewed changes

irfansharif force-pushed the 200901.liveness-join branch 2 times, most recently from e60d516 to 0187c16 Compare September 9, 2020 23:10

irfansharif commented Sep 9, 2020

View reviewed changes

irfansharif force-pushed the 200901.liveness-join branch from 0187c16 to ca1fd4e Compare September 10, 2020 01:54

irfansharif force-pushed the 200901.liveness-join branch from ca1fd4e to 6380a7a Compare September 10, 2020 03:39

irfansharif force-pushed the 200901.liveness-join branch from 6380a7a to 8963e89 Compare September 10, 2020 04:58

irfansharif force-pushed the 200901.liveness-join branch 2 times, most recently from 2443dd6 to 64c3b57 Compare September 10, 2020 08:41

tbg reviewed Sep 10, 2020

View reviewed changes

tbg self-requested a review September 10, 2020 20:03

irfansharif mentioned this pull request Sep 10, 2020

kvserver: address migration concern with node liveness #54216

Merged

irfansharif removed the request for review from tbg September 11, 2020 14:38

irfansharif mentioned this pull request Sep 18, 2020

kvserver: add assertions for invariants around liveness records #54544

Merged

irfansharif mentioned this pull request Oct 29, 2020

*: introduce pkg/migrations #56107

Closed

ajwerner mentioned this pull request Feb 10, 2021

logic: fix flakey logic test #60439

Closed

rafiss added this to the 20.2 milestone Apr 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: always create a liveness record before starting up #53842

server: always create a liveness record before starting up #53842

irfansharif commented Sep 2, 2020 •

edited

Loading

cockroach-teamcity commented Sep 2, 2020

irfansharif commented Sep 2, 2020

irfansharif commented Sep 8, 2020

tbg left a comment

irfansharif left a comment

irfansharif commented Sep 10, 2020

irfansharif commented Sep 10, 2020

craig bot commented Sep 10, 2020

irfansharif commented Sep 10, 2020

craig bot commented Sep 10, 2020

irfansharif commented Sep 10, 2020

craig bot commented Sep 10, 2020

irfansharif commented Sep 10, 2020

irfansharif commented Sep 10, 2020

tbg commented Sep 10, 2020

tbg left a comment

irfansharif commented Sep 10, 2020

server: always create a liveness record before starting up #53842

server: always create a liveness record before starting up #53842

Conversation

irfansharif commented Sep 2, 2020 • edited Loading

cockroach-teamcity commented Sep 2, 2020

irfansharif commented Sep 2, 2020

irfansharif commented Sep 8, 2020

tbg left a comment

Choose a reason for hiding this comment

irfansharif left a comment

Choose a reason for hiding this comment

irfansharif commented Sep 10, 2020

irfansharif commented Sep 10, 2020

craig bot commented Sep 10, 2020

irfansharif commented Sep 10, 2020

craig bot commented Sep 10, 2020

irfansharif commented Sep 10, 2020

craig bot commented Sep 10, 2020

irfansharif commented Sep 10, 2020

irfansharif commented Sep 10, 2020

tbg commented Sep 10, 2020

tbg left a comment

Choose a reason for hiding this comment

irfansharif commented Sep 10, 2020

irfansharif commented Sep 2, 2020 •

edited

Loading