server: always create a liveness record before starting up #53805

irfansharif · 2020-09-02T01:12:51Z

Previously it used to be the case that it was possible for a node to be
up and running, and for there to be no corresponding liveness record for
it. This was a very transient situation as liveness records are created
for a given node as soon as it out its first heartbeat. Still, given
that this could take a few seconds, it lent to a lot of complexity in
our handling of node liveness where we had to always anticipate the
possibility of there being no corresponding liveness record for a given
node (and thus creating it if necessary).

It was originally thought we'd only be able to guarantee persisting a
liveness record for a joining node if we had infrastructure like the
join rpc, introduced in #52526, but staring at it a bit harder we can
simply ensure we heartbeat our liveness record at least once before
fully spinning up our server (aka opening the RPC floodgates).

Having a liveness record for each node always present is a crucial
building block for long running migrations (#48843). There the intention
is to have the orchestrator process look towards the list of liveness
records for an authoritative view of cluster membership. Previously when
it was possible for an active member of the cluster to not have a
corresponding liveness record (no matter how unlikely or short-lived in
practice), we could not generate such a view.

Release justification: low risk, high benefit changes to existing functionality
Release note: None

First commit is from #53713, both are necessary to remove the hack in the decommissioning roachtest.

Deferred doing this in cockroachdb#52526. Probably a good idea to do have it, it'll bring down the cluster convergence time (time taken for all nodes to find out about the initialization) by a bit. Release justification: low risk, high benefit changes to existing functionality Release note: None

cockroach-teamcity · 2020-09-02T01:12:59Z

This change is

irfansharif · 2020-09-02T01:23:27Z

@tbg, I'm a bit confused. I'm not sure why why this wasn't possible before. The only thing I can think of to make us not want to to this is the following scenario:

n1-n3 running v21.1, initialized and running
n4, running v21.1 joins the cluster, gets a node ID, persists it to disk
n4 crashes before creating its liveness record
operator is able to roll n1-n3 to v21.2 and finalize upgrade because n4's liveness record is not present
n4 is never decommissioned, operator never needed to (also it wasn't really possible, even before this PR, seeing as how n4 crashed before being able to create a liveness record - it would always fail with ErrNoLivenessRecord)
n4 is started back up, still running v21.1, finds the cluster/node id on disk and now happily creates its liveness record/continues on merrily.
...boom?

While shitty, this is also a bug that already exists today. I've actually seen it happen in https://github.com/cockroachlabs/support/issues/584 where a node was quickly added+removed before it was able to persist a liveness record for itself. So that said, is there another reason why we shouldn't do what we're doing in this PR?

Previously it used to be the case that it was possible for a node to be up and running, and for there to be no corresponding liveness record for it. This was a very transient situation as liveness records are created for a given node as soon as it out its first heartbeat. Still, given that this could take a few seconds, it lent to a lot of complexity in our handling of node liveness where we had to always anticipate the possibility of there being no corresponding liveness record for a given node (and thus creating it if necessary). It was originally thought we'd only be able to guarantee persisting a liveness record for a joining node if we had infrastructure like the join rpc, introduced in cockroachdb#52526, but staring at it a bit harder we can simply ensure we heartbeat our liveness record at least once before fully spinning up our server (aka opening the RPC floodgates). Having a liveness record for each node always present is a crucial building block for long running migrations (cockroachdb#48843). There the intention is to have the orchestrator process look towards the list of liveness records for an authoritative view of cluster membership. Previously when it was possible for an active member of the cluster to not have a corresponding liveness record (no matter how unlikely or short-lived in practice), we could not generate such a view. Release justification: low risk, high benefit changes to existing functionality Release note: None

irfansharif · 2020-09-02T05:23:46Z

I'm not sure why I'm seeing gossip/locality-address fail with my changes. I suspect it's due to #42741, but I could be wrong.

Edit: hm, looks like it's unable to send itself batch requests (needed to heartbeat) given it's not yet operational. For the following setup:

make buildshort; 
roachprod destroy local; 
roachprod create local -n 4;
roachprod put local ./cockroach ./cockroach;

This works:

roachprod start local:1

But this doesn't:

roachprod start --racks=1 --args=--locality-advertise-addr=rack=0@localhost local:1

Is there some grpc magic "don't send an external rpc if headed for 127.0.0.1" that's coming into play here? TIL. I see that the following also works:

roachprod start --racks=1 [email protected] local:1

tbg · 2020-09-03T09:35:40Z

pkg/server/server.go

+	// Begin the node liveness heartbeat. Add a callback which records the
+	// local store "last up" timestamp for every store whenever the liveness
+	// record is updated. We're sure to do this before we open RPC
+	// floodgates.


Not guaranteed since it's async, so add that it's best effort

tbg · 2020-09-03T09:36:30Z

pkg/server/server.go

+	// local store "last up" timestamp for every store whenever the liveness
+	// record is updated. We're sure to do this before we open RPC
+	// floodgates.
+	var livenessOnce sync.Once


nit: don't leak livenessOnce into main scope, do

livenessRecordCreated := make(chan struct{}, 1) { // all the other stuff } if initialStart { ... }

tbg · 2020-09-03T09:43:11Z

This PR is good and there's no good reason we haven't done this yet. But as you've found out, it also doesn't give us the invariant that we need, which is that the liveness record exists before the node can ever consider itself part of the cluster.

I don't really understand the problem you're seeing regarding the "sending itself heartbeats". Until we're in modeOperational, the node can not receive replicas, so it shouldn't ever be consulted for anything. Gossip should be active by this point so the KV client should be able to route to the correct nodes. Can you post some more details?

Is there some grpc magic "don't send an external rpc if headed for 127.0.0.1" that's coming into play here?

Yes there is! See

cockroach/pkg/rpc/context.go

Line 481 in a3284b7

func (ctx *Context) GetLocalInternalClientForAddr(

But if we rely on that for things working, there's a problem.

Ah, I understand now - the issue is in the case in which this node is the one bootstrapping the local cluster. In that case, in the new code, we're trying to synchronously heartbeat the liveness record but the gRPC server does not yet allow RPCs. Ok, makes sense. Add a comment to that effect and simply move the liveness record stuff past the modeOperational line. Since we are only doing best effort here, it makes little difference (except of course that it works!)

Previously it used to be the case that it was possible for a node to be up and running, and for there to be no corresponding liveness record for it. This was a very transient situation as liveness records are created for a given node as soon as it out its first heartbeat. Still, given that this could take a few seconds, it lent to a lot of complexity in our handling of node liveness where we had to always anticipate the possibility of there being no corresponding liveness record for a given node (and thus creating it if necessary). Having a liveness record for each node always present is a crucial building block for long running migrations (cockroachdb#48843). There the intention is to have the orchestrator process look towards the list of liveness records for an authoritative view of cluster membership. Previously when it was possible for an active member of the cluster to not have a corresponding liveness record (no matter how unlikely or short-lived in practice), we could not generate such a view. --- This is an alternative implementation for cockroachdb#53805. Here we choose to manually write the liveness record for the bootstrapping node when writing initial cluster data. For all other nodes, we do it on the server-side of the join RPC. We're also careful to do it in the legacy codepath when joining a cluster through gossip. Release note: None

irfansharif · 2020-09-09T03:34:36Z

Abandoning this PR in favor of #53842.

Previously it used to be the case that it was possible for a node to be up and running, and for there to be no corresponding liveness record for it. This was a very transient situation as liveness records are created for a given node as soon as it out its first heartbeat. Still, given that this could take a few seconds, it lent to a lot of complexity in our handling of node liveness where we had to always anticipate the possibility of there being no corresponding liveness record for a given node (and thus creating it if necessary). Having a liveness record for each node always present is a crucial building block for long running migrations (cockroachdb#48843). There the intention is to have the orchestrator process look towards the list of liveness records for an authoritative view of cluster membership. Previously when it was possible for an active member of the cluster to not have a corresponding liveness record (no matter how unlikely or short-lived in practice), we could not generate such a view. --- This is an alternative implementation for cockroachdb#53805. Here we choose to manually write the liveness record for the bootstrapping node when writing initial cluster data. For all other nodes, we do it on the server-side of the join RPC. We're also careful to do it in the legacy codepath when joining a cluster through gossip. Release note: None

53842: server: always create a liveness record before starting up r=irfansharif a=irfansharif Previously it used to be the case that it was possible for a node to be up and running, and for there to be no corresponding liveness record for it. This was a very transient situation as liveness records are created for a given node as soon as it out its first heartbeat. Still, given that this could take a few seconds, it lent to a lot of complexity in our handling of node liveness where we had to always anticipate the possibility of there being no corresponding liveness record for a given node (and thus creating it if necessary). Having a liveness record for each node always present is a crucial building block for long running migrations (#48843). There the intention is to have the orchestrator process look towards the list of liveness records for an authoritative view of cluster membership. Previously when it was possible for an active member of the cluster to not have a corresponding liveness record (no matter how unlikely or short-lived in practice), we could not generate such a view. --- This is an alternative implementation for #53805. Here we choose to manually write the liveness record for the bootstrapping node when writing initial cluster data. For all other nodes, we do it on the server-side of the join RPC. We're also careful to do it in the legacy codepath when joining a cluster through gossip. Release note: None 53994: cli: allow SQL commands to use password authn in more cases r=bdarnell,irfansharif,aaron-crl a=knz First two commits from #53991. Previously, SQL password authn was only allowed over TLS connections. With this change, password authn is allowed regardless of whether the connection uses TLS. This is implemented by also only asking for a password interactively the first time that the server complains that pw auth has failed. This way, no password is ever requested interactively if the server "trusts" the connection (via HBA rules or `--insecure`). Release justification: low risk, high benefit changes to existing functionality 54035: sql: emit more tracing events from the stats cache r=RaduBerinde a=RaduBerinde The stats cache has various "slow" paths (where we need to query the system table). These are currently only logged if verbosity is high. This change switches to `VEvent` in most cases, so that these are visible during tracing (including in statement diagnostics bundles). This will allow us to diagnose slow planning times, e.g. due to the stats cache getting full. Release justification: low-risk change to existing functionality, high potential benefit for debugging issues. Release note: None Co-authored-by: irfan sharif <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]> Co-authored-by: Radu Berinde <[email protected]>

Previously it used to be the case that it was possible for a node to be up and running, and for there to be no corresponding liveness record for it. This was a very transient situation as liveness records are created for a given node as soon as it out its first heartbeat. Still, given that this could take a few seconds, it lent to a lot of complexity in our handling of node liveness where we had to always anticipate the possibility of there being no corresponding liveness record for a given node (and thus creating it if necessary). Having a liveness record for each node always present is a crucial building block for long running migrations (cockroachdb#48843). There the intention is to have the orchestrator process look towards the list of liveness records for an authoritative view of cluster membership. Previously when it was possible for an active member of the cluster to not have a corresponding liveness record (no matter how unlikely or short-lived in practice), we could not generate such a view. --- This is an alternative implementation for cockroachdb#53805. Here we choose to manually write the liveness record for the bootstrapping node when writing initial cluster data. For all other nodes, we do it on the server-side of the join RPC. We're also careful to do it in the legacy codepath when joining a cluster through gossip. Release note: None

… not just availability Fixes cockroachdb#53515. We should have the autoupgrade process look at the fully decommissioned bit we added in cockroachdb#50329, instead of just looking at availability. It would avoid the hazard described in cockroachdb#53515. Previously the autoupgrade process was also looking at NodeStatusLiveness, which we've since soured upon (see cockroachdb#50478). Now that we always create a liveness record on start up (cockroachdb#53805), we can simply fetch all liveness records from KV. We add a helper to do this, which we'll also rely on in future PRs for other purposes. It's a bit unfortunate that we're further adding on to the NodeLiveness API without changing the caching structure, but the methods fetching records from KV is the world we're hoping to move towards going forward. Release note: None

Now that we always create a liveness record on start up (cockroachdb#53805), we can simply fetch all liveness records from KV when wanting an up-to-date view of all nodes in the cluster. We add a helper to do as much, which we'll rely on in future commits. It's a bit unfortunate that we're further adding on to the NodeLiveness API without changing the underlying look-aside caching structure, but the methods fetching records from KV is the world we're hoping to start moving towards over time. Release note: None

To flesh out the migrations infrastructure (in future commits), we'll need a handle on all the nodes present in the system. Now that we always create a liveness record on start up (cockroachdb#53805), we can simply fetch all liveness records from KV. We add a helper to do so. It's a bit unfortunate that we're further adding on to the NodeLiveness API without changing the caching structure, but the methods fetching records from KV is the world we're hoping to move towards going forward. This does mean that we're introducing a direct dependency on NodeLiveness in the sql layer, and there's improvements to be made here around interfaces delineating between "system tenant only" sql code and everything else. Only system tenants have the privilege to set cluster settings (or at least the version setting specifically), which is what this API will look to power. Release note: None

Now that we always create a liveness record on start up (cockroachdb#53805), we can simply fetch all liveness records from KV when wanting an up-to-date view of all nodes in the cluster. We add a helper to do as much, which we'll rely on in future commits. It's a bit unfortunate that we're further adding on to the NodeLiveness API without changing the underlying look-aside caching structure, but the methods fetching records from KV is the world we're hoping to start moving towards over time. Release note: None

To flesh out the migrations infrastructure (in future commits), we'll need a handle on all the nodes present in the system. Now that we always create a liveness record on start up (cockroachdb#53805), we can simply fetch all liveness records from KV. We add a helper to do so. It's a bit unfortunate that we're further adding on to the NodeLiveness API without changing the caching structure, but the methods fetching records from KV is the world we're hoping to move towards going forward. This does mean that we're introducing a direct dependency on NodeLiveness in the sql layer, and there's improvements to be made here around interfaces delineating between "system tenant only" sql code and everything else. Only system tenants have the privilege to set cluster settings (or at least the version setting specifically), which is what this API will look to power. Release note: None

Now that we always create a liveness record on start up (cockroachdb#53805), we can simply fetch all records from KV when wanting an up-to-date view of all nodes that have ever been a part of the cluster. We add a helper to do as much, which we'll rely on when introducing long running migrations (cockroachdb#56107). It's a bit unfortunate that we're further adding on to the liveness API without changing the underlying look-aside cache structure, but the up-to-date records from KV directly is the world we're hoping to start moving towards over time. The TODO added in [1] outlines what the future holds. [1]: cockroachdb@d631239 Release note: None

56130: opt: updated normalization rules for folding is expressions with null predicate r=mgartner a=jayshrivastava opt: updated normalization rules for folding is expressions with null predicate Previously, for statements such as `SELECT (foo,bar) IS DISTINCT FROM NULL FROM a_table`, the expression `(foo,bar) IS DISTINCT FROM NULL` would not be normalized to `true`. Similarly, if `IS NOT DISTINCT FROM NULL` were used, then the expression would not be normalized to `false`. The previous statement would only normalize if the tuple/array in the statement contained only constants. Given the updates in this commit, normalization will be applied when any arrays or tuples are provided in this situation. Release note: None 56217: sql: clean up uses of Statement r=RaduBerinde a=RaduBerinde This change cleans up the use of `sql.Statement` and reduces some allocations. Specifically: - we create a `Statement` lower in the stack (in `execStmtInOpenState`), and pass only what we need in the higher layers; - we change various functions to take a `tree.Statement` rather than an entire `Statement` when possible; - we move down the `withStatement` context allocation, so that it is avoided in the implicit transaction state transition; - we store a copy rather than a pointer to the Statement in the planner; - we avoid directly using `stmt` fields from `func()` declarations that escape; - we populate `Statement.AnonymizedStr` upfront. The anonymized string is always needed (to update statement stats). ``` name old time/op new time/op delta EndToEnd/kv-read/EndToEnd 153µs ± 1% 154µs ± 2% ~ (p=0.486 n=4+4) EndToEnd/kv-read-no-prep/EndToEnd 216µs ± 1% 217µs ± 1% ~ (p=0.886 n=4+4) EndToEnd/kv-read-const/EndToEnd 111µs ± 1% 113µs ± 1% +1.01% (p=0.029 n=4+4) name old alloc/op new alloc/op delta EndToEnd/kv-read/EndToEnd 25.8kB ± 1% 25.5kB ± 1% ~ (p=0.114 n=4+4) EndToEnd/kv-read-no-prep/EndToEnd 32.2kB ± 1% 31.9kB ± 1% ~ (p=0.686 n=4+4) EndToEnd/kv-read-const/EndToEnd 21.0kB ± 1% 20.7kB ± 2% ~ (p=0.200 n=4+4) name old allocs/op new allocs/op delta EndToEnd/kv-read/EndToEnd 252 ± 1% 250 ± 0% -0.99% (p=0.029 n=4+4) EndToEnd/kv-read-no-prep/EndToEnd 332 ± 0% 330 ± 1% ~ (p=0.229 n=4+4) EndToEnd/kv-read-const/EndToEnd 214 ± 0% 212 ± 0% -0.93% (p=0.029 n=4+4) ``` Release note: None 56243: liveness: introduce GetLivenessesFromKV r=irfansharif a=irfansharif Now that we always create a liveness record on start up (#53805), we can simply fetch all records from KV when wanting an up-to-date view of all nodes that have ever been a part of the cluster. We add a helper to do as much, which we'll rely on when introducing long running migrations (#56107). It's a bit unfortunate that we're further adding on to the liveness API without changing the underlying look-aside cache structure, but the up-to-date records from KV directly is the world we're hoping to start moving towards over time. The TODO added in [1] outlines what the future holds. We'll also expose the GetLivenessesFromKV API we introduced earlier to pkg/sql. We'll rely on it when needing to plumb in the liveness instance into the migration manager process (prototyped in #56107) It should be noted that this will be a relatively meatier form of a dependency on node liveness from pkg/sql than we have currently. Today the only uses are in DistSQL planning and in jobs[2]. As it relates to our multi-tenancy work, the real use of this API will happen only on the system tenant. System tenants alone have the privilege to set cluster settings (or at least the version setting specifically), which is what the migration manager will be wired into. [1]: d631239 [2]: #48795 Release note: None --- First commit is from #56221, and can be ignored here. Co-authored-by: Jayant Shrivastava <[email protected]> Co-authored-by: Radu Berinde <[email protected]> Co-authored-by: irfan sharif <[email protected]>

irfansharif requested a review from tbg September 2, 2020 01:12

irfansharif force-pushed the 200901.heartbeat-early branch from c8d2132 to ecc0abf Compare September 2, 2020 03:43

irfansharif mentioned this pull request Sep 2, 2020

server: always create a liveness record before starting up #53842

Merged

tbg reviewed Sep 3, 2020

View reviewed changes

irfansharif closed this Sep 9, 2020

irfansharif mentioned this pull request Sep 10, 2020

release-20.2: server: always create a liveness record before starting up #54212

Merged

irfansharif deleted the 200901.heartbeat-early branch September 18, 2020 01:04

irfansharif mentioned this pull request Oct 22, 2020

[wip] kvserver: have autoupgrade process look at decommission status, not just availability #55847

Closed

irfansharif mentioned this pull request Nov 3, 2020

liveness: introduce GetLivenessesFromKV #56243

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: always create a liveness record before starting up #53805

server: always create a liveness record before starting up #53805

irfansharif commented Sep 2, 2020 •

edited

Loading

cockroach-teamcity commented Sep 2, 2020

irfansharif commented Sep 2, 2020 •

edited

Loading

irfansharif commented Sep 2, 2020 •

edited

Loading

tbg Sep 3, 2020

tbg Sep 3, 2020

tbg commented Sep 3, 2020

irfansharif commented Sep 9, 2020

server: always create a liveness record before starting up #53805

server: always create a liveness record before starting up #53805

Conversation

irfansharif commented Sep 2, 2020 • edited Loading

cockroach-teamcity commented Sep 2, 2020

irfansharif commented Sep 2, 2020 • edited Loading

irfansharif commented Sep 2, 2020 • edited Loading

tbg Sep 3, 2020

Choose a reason for hiding this comment

tbg Sep 3, 2020

Choose a reason for hiding this comment

tbg commented Sep 3, 2020

irfansharif commented Sep 9, 2020

irfansharif commented Sep 2, 2020 •

edited

Loading

irfansharif commented Sep 2, 2020 •

edited

Loading

irfansharif commented Sep 2, 2020 •

edited

Loading