-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: need to change /health?ready=1
to returns success only after the node can serve SQL
#58864
Comments
Hi @ajwerner, I've guessed the C-ategory of your issue and suitably labeled it. Please re-label if inaccurate. While you're here, please consider adding an A- label to help keep our repository tidy. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
@knz do you have thoughts here? cc @joshimhoff |
Ya this seems like one of the key action items definitely. Thanks, Andrew! |
ok that is an enhancement request then? what's the urgency? @thtruo can you figure out with the SRE team when we need this and whether it needs to be pulled into the next server milestone thanks |
/health?ready=1
returns success before the node can serve SQL/health?ready=1
to returns success only after the node can serve SQL
Perhaps we should treat it as a bug. I know that it would make our upgrade rollout more robust to failures during startup. |
Works for me, feel free to retitle and relabel. I'll still need to get some input on priority though. |
I was tracing through this Slack thread and got the sense that the impact was high - that was a bad outage. I'm onboard with pulling this issue into the next milestone to help prevent outages like this from happening again. cc @joshimhoff let us know if you have additional input to share around priority/urgency. |
Next milestone obvi good with me :) but my ask would be we get this into some 20.2 patch release ahead of 21.1 being released to CC. Seems to me a major version bump is the most likely time to hit an issue on upgrade that causes SQL unavailability since migrations are run. (Correct if wrong plz.) And the specific issue that we hit this week is fixed in 20.2.3 which is now released to CC. It is a def a REALLY great improvement; it takes a whole class of bugs from causing 1+ hour long global outages [1] to zero data-plane impact (just an upgrade failure)!! These are the kind of action items we want to pull out of incidents; many such action items executed on greatly improves reliability in aggregate. [1] 1+ hour long global outages threaten a 99.9% SLO (2 hours down a quarter... quite weak SLO for a DBaaS) |
Is your feature request related to a problem? Please describe.
The
/health?ready=1
HTTP returns success before we've started to accept SQL connections? It seems to just be checking if the node is live and the grpc is operational. This can be problematic if the server gets caught in startup migrations.Describe the solution you'd like
Wait to set a bit for ready until the server is actually ready to serve requests.
Describe alternatives you've considered
Add another param?
Additional context
This has been making automated upgrades seem like they are going well despite the fact that they are hung. For example, in some cases pre-20.2.3 releases in 20.2 are susceptible to #57437.
There's also reports that cockroach nodes advertise availability too early and end up creating latency spikes upon upgrade/restart.
The text was updated successfully, but these errors were encountered: