server: need to change `/health?ready=1` to returns success only after the node can serve SQL #58864

ajwerner · 2021-01-12T16:34:59Z

Is your feature request related to a problem? Please describe.

The /health?ready=1 HTTP returns success before we've started to accept SQL connections? It seems to just be checking if the node is live and the grpc is operational. This can be problematic if the server gets caught in startup migrations.

Describe the solution you'd like

Wait to set a bit for ready until the server is actually ready to serve requests.

Describe alternatives you've considered

Add another param?

Additional context

This has been making automated upgrades seem like they are going well despite the fact that they are hung. For example, in some cases pre-20.2.3 releases in 20.2 are susceptible to #57437.

There's also reports that cockroach nodes advertise availability too early and end up creating latency spikes upon upgrade/restart.

The text was updated successfully, but these errors were encountered:

blathers-crl · 2021-01-12T16:35:10Z

Hi @ajwerner, I've guessed the C-ategory of your issue and suitably labeled it. Please re-label if inaccurate.

While you're here, please consider adding an A- label to help keep our repository tidy.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

ajwerner · 2021-01-12T16:37:01Z

@knz do you have thoughts here? cc @joshimhoff

joshimhoff · 2021-01-12T16:51:33Z

Ya this seems like one of the key action items definitely. Thanks, Andrew!

knz · 2021-01-14T17:55:31Z

ok that is an enhancement request then? what's the urgency?

@thtruo can you figure out with the SRE team when we need this and whether it needs to be pulled into the next server milestone thanks

ajwerner · 2021-01-14T18:24:13Z

Perhaps we should treat it as a bug. I know that it would make our upgrade rollout more robust to failures during startup.

knz · 2021-01-14T22:21:54Z

Works for me, feel free to retitle and relabel. I'll still need to get some input on priority though.

thtruo · 2021-01-14T23:37:37Z

I was tracing through this Slack thread and got the sense that the impact was high - that was a bad outage.

I'm onboard with pulling this issue into the next milestone to help prevent outages like this from happening again.

cc @joshimhoff let us know if you have additional input to share around priority/urgency.

joshimhoff · 2021-01-14T23:57:26Z

Next milestone obvi good with me :) but my ask would be we get this into some 20.2 patch release ahead of 21.1 being released to CC. Seems to me a major version bump is the most likely time to hit an issue on upgrade that causes SQL unavailability since migrations are run. (Correct if wrong plz.) And the specific issue that we hit this week is fixed in 20.2.3 which is now released to CC.

It is a def a REALLY great improvement; it takes a whole class of bugs from causing 1+ hour long global outages [1] to zero data-plane impact (just an upgrade failure)!! These are the kind of action items we want to pull out of incidents; many such action items executed on greatly improves reliability in aggregate.

[1] 1+ hour long global outages threaten a 99.9% SLO (2 hours down a quarter... quite weak SLO for a DBaaS)

blathers-crl bot added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jan 12, 2021

knz added the A-kv-server Relating to the KV-level RPC server label Jan 14, 2021

knz changed the title ~~server: /health?ready=1 returns success before the node can serve SQL~~ server: need to change /health?ready=1 to returns success only after the node can serve SQL Jan 14, 2021

This was referenced Jan 20, 2021

server: add SQL checks to the /health?ready=1 probe #59191

Draft

server: wait for SQL readiness in the /health?ready=1 probe #59350

Merged

craig bot closed this as completed in bf439cd Jan 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: need to change `/health?ready=1` to returns success only after the node can serve SQL #58864

server: need to change `/health?ready=1` to returns success only after the node can serve SQL #58864

ajwerner commented Jan 12, 2021 •

edited

Loading

blathers-crl bot commented Jan 12, 2021

ajwerner commented Jan 12, 2021

joshimhoff commented Jan 12, 2021

knz commented Jan 14, 2021

ajwerner commented Jan 14, 2021

knz commented Jan 14, 2021

thtruo commented Jan 14, 2021

joshimhoff commented Jan 14, 2021 •

edited

Loading

server: need to change /health?ready=1 to returns success only after the node can serve SQL #58864

server: need to change /health?ready=1 to returns success only after the node can serve SQL #58864

Comments

ajwerner commented Jan 12, 2021 • edited Loading

blathers-crl bot commented Jan 12, 2021

ajwerner commented Jan 12, 2021

joshimhoff commented Jan 12, 2021

knz commented Jan 14, 2021

ajwerner commented Jan 14, 2021

knz commented Jan 14, 2021

thtruo commented Jan 14, 2021

joshimhoff commented Jan 14, 2021 • edited Loading

server: need to change `/health?ready=1` to returns success only after the node can serve SQL #58864

server: need to change `/health?ready=1` to returns success only after the node can serve SQL #58864

ajwerner commented Jan 12, 2021 •

edited

Loading

joshimhoff commented Jan 14, 2021 •

edited

Loading