Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: always create a liveness record before starting up #53805

Closed

Conversation

irfansharif
Copy link
Contributor

@irfansharif irfansharif commented Sep 2, 2020

Previously it used to be the case that it was possible for a node to be
up and running, and for there to be no corresponding liveness record for
it. This was a very transient situation as liveness records are created
for a given node as soon as it out its first heartbeat. Still, given
that this could take a few seconds, it lent to a lot of complexity in
our handling of node liveness where we had to always anticipate the
possibility of there being no corresponding liveness record for a given
node (and thus creating it if necessary).

It was originally thought we'd only be able to guarantee persisting a
liveness record for a joining node if we had infrastructure like the
join rpc, introduced in #52526, but staring at it a bit harder we can
simply ensure we heartbeat our liveness record at least once before
fully spinning up our server (aka opening the RPC floodgates).

Having a liveness record for each node always present is a crucial
building block for long running migrations (#48843). There the intention
is to have the orchestrator process look towards the list of liveness
records for an authoritative view of cluster membership. Previously when
it was possible for an active member of the cluster to not have a
corresponding liveness record (no matter how unlikely or short-lived in
practice), we could not generate such a view.

Release justification: low risk, high benefit changes to existing functionality
Release note: None


First commit is from #53713, both are necessary to remove the hack in the decommissioning roachtest.

Deferred doing this in cockroachdb#52526. Probably a good idea to do have it, it'll
bring down the cluster convergence time (time taken for all nodes to
find out about the initialization) by a bit.

Release justification: low risk, high benefit changes to existing functionality
Release note: None
@irfansharif irfansharif requested a review from tbg September 2, 2020 01:12
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@irfansharif
Copy link
Contributor Author

irfansharif commented Sep 2, 2020

@tbg, I'm a bit confused. I'm not sure why why this wasn't possible before. The only thing I can think of to make us not want to to this is the following scenario:

  • n1-n3 running v21.1, initialized and running
  • n4, running v21.1 joins the cluster, gets a node ID, persists it to disk
  • n4 crashes before creating its liveness record
  • operator is able to roll n1-n3 to v21.2 and finalize upgrade because n4's liveness record is not present
  • n4 is never decommissioned, operator never needed to (also it wasn't really possible, even before this PR, seeing as how n4 crashed before being able to create a liveness record - it would always fail with ErrNoLivenessRecord)
  • n4 is started back up, still running v21.1, finds the cluster/node id on disk and now happily creates its liveness record/continues on merrily.
  • ...boom?

While shitty, this is also a bug that already exists today. I've actually seen it happen in https://github.com/cockroachlabs/support/issues/584 where a node was quickly added+removed before it was able to persist a liveness record for itself. So that said, is there another reason why we shouldn't do what we're doing in this PR?

Previously it used to be the case that it was possible for a node to be
up and running, and for there to be no corresponding liveness record for
it. This was a very transient situation as liveness records are created
for a given node as soon as it out its first heartbeat. Still, given
that this could take a few seconds, it lent to a lot of complexity in
our handling of node liveness where we had to always anticipate the
possibility of there being no corresponding liveness record for a given
node (and thus creating it if necessary).

It was originally thought we'd only be able to guarantee persisting a
liveness record for a joining node if we had infrastructure like the
join rpc, introduced in cockroachdb#52526, but staring at it a bit harder we can
simply ensure we heartbeat our liveness record at least once before
fully spinning up our server (aka opening the RPC floodgates).

Having a liveness record for each node always present is a crucial
building block for long running migrations (cockroachdb#48843). There the intention
is to have the orchestrator process look towards the list of liveness
records for an authoritative view of cluster membership. Previously when
it was possible for an active member of the cluster to not have a
corresponding liveness record (no matter how unlikely or short-lived in
practice), we could not generate such a view.

Release justification: low risk, high benefit changes to existing functionality
Release note: None
@irfansharif irfansharif force-pushed the 200901.heartbeat-early branch from c8d2132 to ecc0abf Compare September 2, 2020 03:43
@irfansharif
Copy link
Contributor Author

irfansharif commented Sep 2, 2020

I'm not sure why I'm seeing gossip/locality-address fail with my changes. I suspect it's due to #42741, but I could be wrong.

Edit: hm, looks like it's unable to send itself batch requests (needed to heartbeat) given it's not yet operational. For the following setup:

make buildshort; 
roachprod destroy local; 
roachprod create local -n 4;
roachprod put local ./cockroach ./cockroach; 

This works:

roachprod start local:1

But this doesn't:

roachprod start --racks=1 --args=--locality-advertise-addr=rack=0@localhost local:1

Is there some grpc magic "don't send an external rpc if headed for 127.0.0.1" that's coming into play here? TIL. I see that the following also works:

roachprod start --racks=1 [email protected] local:1

// Begin the node liveness heartbeat. Add a callback which records the
// local store "last up" timestamp for every store whenever the liveness
// record is updated. We're sure to do this before we open RPC
// floodgates.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not guaranteed since it's async, so add that it's best effort

// local store "last up" timestamp for every store whenever the liveness
// record is updated. We're sure to do this before we open RPC
// floodgates.
var livenessOnce sync.Once
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: don't leak livenessOnce into main scope, do

livenessRecordCreated := make(chan struct{}, 1)
{
  // all the other stuff
}
if initialStart { ... }

@tbg
Copy link
Member

tbg commented Sep 3, 2020

This PR is good and there's no good reason we haven't done this yet. But as you've found out, it also doesn't give us the invariant that we need, which is that the liveness record exists before the node can ever consider itself part of the cluster.

I don't really understand the problem you're seeing regarding the "sending itself heartbeats". Until we're in modeOperational, the node can not receive replicas, so it shouldn't ever be consulted for anything. Gossip should be active by this point so the KV client should be able to route to the correct nodes. Can you post some more details?

Is there some grpc magic "don't send an external rpc if headed for 127.0.0.1" that's coming into play here?

Yes there is! See

func (ctx *Context) GetLocalInternalClientForAddr(

But if we rely on that for things working, there's a problem.

Ah, I understand now - the issue is in the case in which this node is the one bootstrapping the local cluster. In that case, in the new code, we're trying to synchronously heartbeat the liveness record but the gRPC server does not yet allow RPCs. Ok, makes sense. Add a comment to that effect and simply move the liveness record stuff past the modeOperational line. Since we are only doing best effort here, it makes little difference (except of course that it works!)

irfansharif added a commit to irfansharif/cockroach that referenced this pull request Sep 9, 2020
Previously it used to be the case that it was possible for a node to be
up and running, and for there to be no corresponding liveness record for
it. This was a very transient situation as liveness records are created
for a given node as soon as it out its first heartbeat. Still, given
that this could take a few seconds, it lent to a lot of complexity in
our handling of node liveness where we had to always anticipate the
possibility of there being no corresponding liveness record for a given
node (and thus creating it if necessary).

Having a liveness record for each node always present is a crucial
building block for long running migrations (cockroachdb#48843). There the intention
is to have the orchestrator process look towards the list of liveness
records for an authoritative view of cluster membership. Previously when
it was possible for an active member of the cluster to not have a
corresponding liveness record (no matter how unlikely or short-lived in
practice), we could not generate such a view.

---

This is an alternative implementation for cockroachdb#53805. Here we choose to
manually write the liveness record for the bootstrapping node when
writing initial cluster data. For all other nodes, we do it on the
server-side of the join RPC. We're also careful to do it in the legacy
codepath when joining a cluster through gossip.

Release note: None
@irfansharif
Copy link
Contributor Author

Abandoning this PR in favor of #53842.

@irfansharif irfansharif closed this Sep 9, 2020
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Sep 10, 2020
Previously it used to be the case that it was possible for a node to be
up and running, and for there to be no corresponding liveness record for
it. This was a very transient situation as liveness records are created
for a given node as soon as it out its first heartbeat. Still, given
that this could take a few seconds, it lent to a lot of complexity in
our handling of node liveness where we had to always anticipate the
possibility of there being no corresponding liveness record for a given
node (and thus creating it if necessary).

Having a liveness record for each node always present is a crucial
building block for long running migrations (cockroachdb#48843). There the intention
is to have the orchestrator process look towards the list of liveness
records for an authoritative view of cluster membership. Previously when
it was possible for an active member of the cluster to not have a
corresponding liveness record (no matter how unlikely or short-lived in
practice), we could not generate such a view.

---

This is an alternative implementation for cockroachdb#53805. Here we choose to
manually write the liveness record for the bootstrapping node when
writing initial cluster data. For all other nodes, we do it on the
server-side of the join RPC. We're also careful to do it in the legacy
codepath when joining a cluster through gossip.

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Sep 10, 2020
Previously it used to be the case that it was possible for a node to be
up and running, and for there to be no corresponding liveness record for
it. This was a very transient situation as liveness records are created
for a given node as soon as it out its first heartbeat. Still, given
that this could take a few seconds, it lent to a lot of complexity in
our handling of node liveness where we had to always anticipate the
possibility of there being no corresponding liveness record for a given
node (and thus creating it if necessary).

Having a liveness record for each node always present is a crucial
building block for long running migrations (cockroachdb#48843). There the intention
is to have the orchestrator process look towards the list of liveness
records for an authoritative view of cluster membership. Previously when
it was possible for an active member of the cluster to not have a
corresponding liveness record (no matter how unlikely or short-lived in
practice), we could not generate such a view.

---

This is an alternative implementation for cockroachdb#53805. Here we choose to
manually write the liveness record for the bootstrapping node when
writing initial cluster data. For all other nodes, we do it on the
server-side of the join RPC. We're also careful to do it in the legacy
codepath when joining a cluster through gossip.

Release note: None
craig bot pushed a commit that referenced this pull request Sep 10, 2020
53842: server: always create a liveness record before starting up r=irfansharif a=irfansharif

Previously it used to be the case that it was possible for a node to be
up and running, and for there to be no corresponding liveness record for
it. This was a very transient situation as liveness records are created
for a given node as soon as it out its first heartbeat. Still, given
that this could take a few seconds, it lent to a lot of complexity in
our handling of node liveness where we had to always anticipate the
possibility of there being no corresponding liveness record for a given
node (and thus creating it if necessary).

Having a liveness record for each node always present is a crucial
building block for long running migrations (#48843). There the intention
is to have the orchestrator process look towards the list of liveness
records for an authoritative view of cluster membership. Previously when
it was possible for an active member of the cluster to not have a
corresponding liveness record (no matter how unlikely or short-lived in
practice), we could not generate such a view.

---

This is an alternative implementation for #53805. Here we choose to
manually write the liveness record for the bootstrapping node when
writing initial cluster data. For all other nodes, we do it on the
server-side of the join RPC. We're also careful to do it in the legacy
codepath when joining a cluster through gossip.

Release note: None


53994: cli: allow SQL commands to use password authn in more cases r=bdarnell,irfansharif,aaron-crl a=knz

First two commits from #53991.

Previously, SQL password authn was only allowed over TLS connections.

With this change, password authn is allowed regardless of whether the
connection uses TLS.

This is implemented by also only asking for a password interactively
the first time that the server complains that pw auth has failed. This
way, no password is ever requested interactively if the server
"trusts" the connection (via HBA rules or `--insecure`).

Release justification: low risk, high benefit changes to existing functionality


54035: sql: emit more tracing events from the stats cache r=RaduBerinde a=RaduBerinde

The stats cache has various "slow" paths (where we need to query the
system table). These are currently only logged if verbosity is high.

This change switches to `VEvent` in most cases, so that these are
visible during tracing (including in statement diagnostics bundles).
This will allow us to diagnose slow planning times, e.g. due to the
stats cache getting full.

Release justification: low-risk change to existing functionality, high
potential benefit for debugging issues.

Release note: None

Co-authored-by: irfan sharif <[email protected]>
Co-authored-by: Raphael 'kena' Poss <[email protected]>
Co-authored-by: Radu Berinde <[email protected]>
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Sep 11, 2020
Previously it used to be the case that it was possible for a node to be
up and running, and for there to be no corresponding liveness record for
it. This was a very transient situation as liveness records are created
for a given node as soon as it out its first heartbeat. Still, given
that this could take a few seconds, it lent to a lot of complexity in
our handling of node liveness where we had to always anticipate the
possibility of there being no corresponding liveness record for a given
node (and thus creating it if necessary).

Having a liveness record for each node always present is a crucial
building block for long running migrations (cockroachdb#48843). There the intention
is to have the orchestrator process look towards the list of liveness
records for an authoritative view of cluster membership. Previously when
it was possible for an active member of the cluster to not have a
corresponding liveness record (no matter how unlikely or short-lived in
practice), we could not generate such a view.

---

This is an alternative implementation for cockroachdb#53805. Here we choose to
manually write the liveness record for the bootstrapping node when
writing initial cluster data. For all other nodes, we do it on the
server-side of the join RPC. We're also careful to do it in the legacy
codepath when joining a cluster through gossip.

Release note: None
@irfansharif irfansharif deleted the 200901.heartbeat-early branch September 18, 2020 01:04
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Oct 22, 2020
… not just availability

Fixes cockroachdb#53515.

We should have the autoupgrade process look at the fully decommissioned
bit we added in cockroachdb#50329, instead of just looking at availability. It
would avoid the hazard described in cockroachdb#53515.

Previously the autoupgrade process was also looking at
NodeStatusLiveness, which we've since soured upon (see cockroachdb#50478). Now that
we always create a liveness record on start up (cockroachdb#53805), we can simply
fetch all liveness records from KV. We add a helper to do this, which
we'll also rely on in future PRs for other purposes. It's a bit
unfortunate that we're further adding on to the NodeLiveness API without
changing the caching structure, but the methods fetching records from KV
is the world we're hoping to move towards going forward.

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Oct 27, 2020
Now that we always create a liveness record on start up (cockroachdb#53805), we can
simply fetch all liveness records from KV when wanting an up-to-date
view of all nodes in the cluster. We add a helper to do as much,
which we'll rely on in future commits. It's a bit unfortunate that we're
further adding on to the NodeLiveness API without changing the
underlying look-aside caching structure, but the methods fetching
records from KV is the world we're hoping to start moving towards over
time.

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Oct 27, 2020
To flesh out the migrations infrastructure (in future commits), we'll
need a handle on all the nodes present in the system. Now that we always
create a liveness record on start up (cockroachdb#53805), we can simply fetch all
liveness records from KV. We add a helper to do so.

It's a bit unfortunate that we're further adding on to the NodeLiveness
API without changing the caching structure, but the methods fetching
records from KV is the world we're hoping to move towards going forward.

This does mean that we're introducing a direct dependency on
NodeLiveness in the sql layer, and there's improvements to be made here
around interfaces delineating between "system tenant only" sql code and
everything else. Only system tenants have the privilege to set cluster
settings (or at least the version setting specifically), which is what
this API will look to power.

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Oct 29, 2020
Now that we always create a liveness record on start up (cockroachdb#53805), we can
simply fetch all liveness records from KV when wanting an up-to-date
view of all nodes in the cluster. We add a helper to do as much,
which we'll rely on in future commits. It's a bit unfortunate that we're
further adding on to the NodeLiveness API without changing the
underlying look-aside caching structure, but the methods fetching
records from KV is the world we're hoping to start moving towards over
time.

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Oct 29, 2020
To flesh out the migrations infrastructure (in future commits), we'll
need a handle on all the nodes present in the system. Now that we always
create a liveness record on start up (cockroachdb#53805), we can simply fetch all
liveness records from KV. We add a helper to do so.

It's a bit unfortunate that we're further adding on to the NodeLiveness
API without changing the caching structure, but the methods fetching
records from KV is the world we're hoping to move towards going forward.

This does mean that we're introducing a direct dependency on
NodeLiveness in the sql layer, and there's improvements to be made here
around interfaces delineating between "system tenant only" sql code and
everything else. Only system tenants have the privilege to set cluster
settings (or at least the version setting specifically), which is what
this API will look to power.

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Nov 3, 2020
Now that we always create a liveness record on start up (cockroachdb#53805), we can
simply fetch all records from KV when wanting an up-to-date view of all
nodes that have ever been a part of the cluster. We add a helper to do
as much, which we'll rely on when introducing long running migrations
(cockroachdb#56107).

It's a bit unfortunate that we're further adding on to the liveness API
without changing the underlying look-aside cache structure, but the
up-to-date records from KV directly is the world we're hoping to start
moving towards over time. The TODO added in [1] outlines what the future
holds.

[1]: cockroachdb@d631239

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Nov 3, 2020
Now that we always create a liveness record on start up (cockroachdb#53805), we can
simply fetch all records from KV when wanting an up-to-date view of all
nodes that have ever been a part of the cluster. We add a helper to do
as much, which we'll rely on when introducing long running migrations
(cockroachdb#56107).

It's a bit unfortunate that we're further adding on to the liveness API
without changing the underlying look-aside cache structure, but the
up-to-date records from KV directly is the world we're hoping to start
moving towards over time. The TODO added in [1] outlines what the future
holds.

[1]: cockroachdb@d631239

Release note: None
craig bot pushed a commit that referenced this pull request Nov 3, 2020
56130: opt: updated normalization rules for folding is expressions with null predicate r=mgartner a=jayshrivastava

opt: updated normalization rules for folding is expressions with null predicate

Previously, for statements such as `SELECT (foo,bar) IS DISTINCT FROM NULL FROM a_table`,
the expression `(foo,bar) IS DISTINCT FROM NULL` would not be normalized to `true`.
Similarly, if `IS NOT DISTINCT FROM NULL` were used, then the expression would not be
normalized to `false`. The previous statement would only normalize if the tuple/array in
the statement contained only constants. Given the updates in this commit, normalization
will be applied when any arrays or tuples are provided in this situation.

Release note: None

56217: sql: clean up uses of Statement r=RaduBerinde a=RaduBerinde

This change cleans up the use of `sql.Statement` and reduces some
allocations. Specifically:

 - we create a  `Statement` lower in the stack (in `execStmtInOpenState`),
   and pass only what we need in the higher layers;

 - we change various functions to take a `tree.Statement` rather than
   an entire `Statement` when possible;

 - we move down the `withStatement` context allocation, so that it is
   avoided in the implicit transaction state transition;

 - we store a copy rather than a pointer to the Statement in the
   planner;

 - we avoid directly using `stmt` fields from `func()` declarations
   that escape;

 - we populate `Statement.AnonymizedStr` upfront. The anonymized
   string is always needed (to update statement stats).

```
name                               old time/op    new time/op    delta
EndToEnd/kv-read/EndToEnd             153µs ± 1%     154µs ± 2%    ~     (p=0.486 n=4+4)
EndToEnd/kv-read-no-prep/EndToEnd     216µs ± 1%     217µs ± 1%    ~     (p=0.886 n=4+4)
EndToEnd/kv-read-const/EndToEnd       111µs ± 1%     113µs ± 1%  +1.01%  (p=0.029 n=4+4)

name                               old alloc/op   new alloc/op   delta
EndToEnd/kv-read/EndToEnd            25.8kB ± 1%    25.5kB ± 1%    ~     (p=0.114 n=4+4)
EndToEnd/kv-read-no-prep/EndToEnd    32.2kB ± 1%    31.9kB ± 1%    ~     (p=0.686 n=4+4)
EndToEnd/kv-read-const/EndToEnd      21.0kB ± 1%    20.7kB ± 2%    ~     (p=0.200 n=4+4)

name                               old allocs/op  new allocs/op  delta
EndToEnd/kv-read/EndToEnd               252 ± 1%       250 ± 0%  -0.99%  (p=0.029 n=4+4)
EndToEnd/kv-read-no-prep/EndToEnd       332 ± 0%       330 ± 1%    ~     (p=0.229 n=4+4)
EndToEnd/kv-read-const/EndToEnd         214 ± 0%       212 ± 0%  -0.93%  (p=0.029 n=4+4)
```

Release note: None

56243: liveness: introduce GetLivenessesFromKV r=irfansharif a=irfansharif

Now that we always create a liveness record on start up (#53805), we can simply
fetch all records from KV when wanting an up-to-date view of all nodes that
have ever been a part of the cluster. We add a helper to do as much, which
we'll rely on when introducing long running migrations (#56107).

It's a bit unfortunate that we're further adding on to the liveness API without
changing the underlying look-aside cache structure, but the up-to-date records
from KV directly is the world we're hoping to start moving towards over time.
The TODO added in [1] outlines what the future holds.

We'll also expose the GetLivenessesFromKV API we introduced earlier to pkg/sql.
We'll rely on it when needing to plumb in the liveness instance into the
migration manager process (prototyped in #56107)

It should be noted that this will be a relatively meatier form of a dependency
on node liveness from pkg/sql than we have currently. Today the only uses are
in DistSQL planning and in jobs[2]. As it relates to our multi-tenancy work,
the real use of this API will happen only on the system tenant. System tenants
alone have the privilege to set cluster settings (or at least the version
setting specifically), which is what the migration manager will be wired into.

[1]: d631239
[2]: #48795

Release note: None

---

First commit is from #56221, and can be ignored here.


Co-authored-by: Jayant Shrivastava <[email protected]>
Co-authored-by: Radu Berinde <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants