server: add SQL checks to the `/health?ready=1` probe #59191

knz · 2021-01-20T13:40:04Z

Previously, it was possible for the /health?ready=1 probe
to return "OK" when the SQL service was unavailable.

This patch changes this by performing some crude, basic SQL checks on
system tables as part of the check.

Release note (api change): the health API now includes basic
SQL checks as part of readiness checks.

Previously, it was possible for the `/health?ready=1` probe to return "OK" when the SQL service was unavailable. This patch changes this by performing some crude, basic SQL checks on system tables as part of the check. Release note (api change): the health API now includes basic SQL checks as part of readiness checks.

cockroach-teamcity · 2021-01-20T13:40:13Z

This change is

knz · 2021-01-20T13:40:23Z

Airing this out for discussion: what kind of SQL checks do we actually want in there?

tbg · 2021-01-20T13:44:39Z

pkg/server/admin.go

+`)
+		return err
+	}); err != nil {
+		// TODO(knz): When always-on tracing is available via SQL, we may


You don't need always-on tracing here - you can just trace it verbosely, period. The main reason not to is that we shouldn't verbosely trace anywhere until we have proper memory accounting, see https://github.com/cockroachdb/cockroach/projects/57#card-53177248

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @joshimhoff, @knz, and @tbg)

pkg/server/admin.go, line 1410 at r1 (raw file):

	timeoutCtx, cancel := context.WithTimeout(ctx, 1*time.Second)
	defer cancel()
	if err := s.server.RunLocalSQL(timeoutCtx, func(ctx context.Context, ie *sql.InternalExecutor) error {

I'm skeptical of this choice. In a geodistributed cluster, this may now take many hundreds of milliseconds.

I'm content with just the accepting clients bool as fixing this issue. I'm open to something additional but I don't think it should require issuing any RPCs let alone many.

tbg · 2021-01-20T13:45:47Z

pkg/server/admin.go

+	// unavailable ranges translate to a health failure.
+	timeoutCtx, cancel := context.WithTimeout(ctx, 1*time.Second)
+	defer cancel()
+	if err := s.server.RunLocalSQL(timeoutCtx, func(ctx context.Context, ie *sql.InternalExecutor) error {


Don't we need some form of rate limiting here to ensure that this code doesn't trigger at a high rate? The health endpoint is the most accessible endpoint in the system.

knz

Thanks for the initial comments. Please keep them coming!

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @irfansharif, @joshimhoff, and @tbg)

pkg/server/admin.go, line 1410 at r1 (raw file):

Previously, ajwerner wrote…

I'm skeptical of this choice. In a geodistributed cluster, this may now take many hundreds of milliseconds.

I'm content with just the accepting clients bool as fixing this issue. I'm open to something additional but I don't think it should require issuing any RPCs let alone many.

In a geo-distributed cluster, if looking at these system tables is taking "many hundreds of milliseconds" with AS OF SYSTEM TIME 5 seconds in the past, IMHO the cluster is not healthy. There'd be a need for a custom zone config in that case.

pkg/server/admin.go, line 1410 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Don't we need some form of rate limiting here to ensure that this code doesn't trigger at a high rate? The health endpoint is the most accessible endpoint in the system.

Good point! Will need to add that indeed.

pkg/server/admin.go, line 1431 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

You don't need always-on tracing here - you can just trace it verbosely, period. The main reason not to is that we shouldn't verbosely trace anywhere until we have proper memory accounting, see https://github.com/cockroachdb/cockroach/projects/57#card-53177248

Good point. Will include this in comment.

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @joshimhoff, @knz, and @tbg)

pkg/server/admin.go, line 1410 at r1 (raw file):

Previously, knz (kena) wrote…

In a geo-distributed cluster, if looking at these system tables is taking "many hundreds of milliseconds" with AS OF SYSTEM TIME 5 seconds in the past, IMHO the cluster is not healthy. There'd be a need for a custom zone config in that case.

Distance from singapore to dc         >= 15,528 km
Speed of light in a fiber optic cable = 199861.2127778 km/s

Network RTT >= 2 * ((15,528 km) / (199861.2127778 km/s))
            >= 155.387 ms

There are a number of very valid reasons why this make not occur perfectly in parallel. The least of which is that we always resolve a bunch of system tables using system.namespace. Zone configs can't save you here.

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @joshimhoff, @knz, and @tbg)

pkg/server/admin.go, line 1410 at r1 (raw file):

Previously, ajwerner wrote…

Distance from singapore to dc         >= 15,528 km
Speed of light in a fiber optic cable = 199861.2127778 km/s

Network RTT >= 2 * ((15,528 km) / (199861.2127778 km/s))
            >= 155.387 ms
There are a number of very valid reasons why this make not occur perfectly in parallel. The least of which is that we always resolve a bunch of system tables using system.namespace. Zone configs can't save you here.

Also, follower reads only exists on enterprise clusters and also depends heavily on cluster settings.

knz · 2021-01-20T14:02:22Z

we always resolve a bunch of system tables using system.namespace. Zone configs can't save you here.

If I have one replica per region, then the AOST clause should ensure the nearest replica is hit and I never pay a cross-continent latency?

ajwerner

Even if we assumed follower reads worked, which they may not, and that the historical timestamp was far enough in the past to use it, there are valid reasons to constrain your system tables to, say, the US or Europe even if you have nodes on different continents. Lastly, there are cases where you may need to go read the meta ranges which will be slow.

Perhaps that's actually a valid reason to not report ready on this endpoint and ideally startup does warm the range cache reasonably.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @joshimhoff, @knz, and @tbg)

knz · 2021-01-20T14:10:26Z

Discussed with the KV pod: Tobias proposes to run the SQL query periodically (e.g. every second or 5 seconds), store its result in a global boolean, and read that boolean during the health probe.

This way we'd have a "SQL prober" task running continuously, and we could even afford adding additional checks into it over time, like Josh was doing in his other PR.

knz · 2021-01-20T14:11:13Z

Another advantage: this way we can get periodic health reports on SQL in the logs, even when no tool is currently polling the health URL.

ajwerner · 2021-01-20T14:11:43Z

I'm pleased with that.

joshimhoff · 2021-01-20T14:28:06Z

pkg/server/admin.go

@@ -1397,6 +1397,43 @@ func (s *adminServer) checkReadinessForHealthCheck() error {
 		return status.Errorf(codes.Unavailable, "node is shutting down")
 	}

+	if !s.server.sqlServer.acceptingClients.Get() {


CC @ajwerner. Would just the accepting clients check have been enough to detect the outage from last week caused by the rogue SQL migration (not sure that's actually the right language)?

Yes, it would have. We run the startup migrations before this point.

joshimhoff · 2021-01-20T14:42:16Z

pkg/server/admin.go

+		return status.Errorf(codes.Unavailable, "node is not accepting SQL clients")
+	}
+
+	// Also check the availability of the SQL service.


CC @chrisseto to weigh in!

In CC, we use /health?ready=1 as a readiness probe (https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/). This probe is used in two places (I think just these two places):

When doing upgrades, we roll a single node over, then wait until all nodes in the cluster are ready, meaning the probe passes on all nodes. Then we continue to another node.

If a node starts failing the readiness probe, k8s will stop the node from "receiving traffic through Kubernetes Services" (https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-readiness-probes). It's actually not clear to me what this means with a stateful protocol like SQL; are existing connections dropped on the floor? I wonder if @chrisseto knows.

Re: 1, querying these system tables seems like a useful thing to do. We will halt bad upgrades earlier.

Re: 2, I am not so sure. All these system tables become critical dependencies. If one of em starts having problems, then our k8s setup may lead to a big drop in capacity, as all nodes will detect this at the same time. This may make the actual problem worse. The feedback loop introduced worries me.

What do folks think about this? There is no fundamental reason that we need to use the same probe for 1 & 2 btw. OTOH, k8s may make certain assumptions that make it somewhat finicky not to. Again @chrisseto to comment on that.

it's actually not clear to me what this means with a stateful protocol like SQL; are existing connections dropped on the floor?

Existing connections will not be dropped. The pod will just not receive any new connections from the load balancer. Though it gets a bit weird, as other nodes will still be able to discover this one (our service discover Service doesn't require readiness). So it would be possible for a node to be executing dist sql queries (Can someone fact check me here?).

There is no fundamental reason that we need to use the same probe for 1 & 2 btw. OTOH, k8s may make certain assumptions that make it somewhat finicky not to.

Can you say more about this? AFAIK they have to be the same probe and they should be the same probe from k8s perspective. Do you mean that we could have a manual probe for the rolling upgrade?

My main question is whether or not this check adds a dependency on other nodes being live/ready. If I have an 8 node cluster and only 3 nodes are up, will all nodes become unready?

Our Statefulsets are setup to manage all pods in parallel and nodes will be able to connect to unready nodes, so I don't think this will have any unintended consequences.

Existing connections will not be dropped.

Very interesting!

joshimhoff · 2021-01-20T14:51:16Z

pkg/server/admin.go

+		return status.Errorf(codes.Unavailable, "node is not accepting SQL clients")
+	}
+
+	// Also check the availability of the SQL service.


An alternative design to consider:

Here we do a basic minimum check: !s.server.sqlServer.acceptingClients.Get. If the probe fails, we halt the upgrade.

In the CC / k8s operator upgrade automation, we check time-series metrics during the upgrade. If the metrics look bad, e.g. if they look different between the updated node & a node running the old version, we halt the upgrade.

Why check metrics? Well, the aggregations we can do in prometheus may lead to higher signal to noise than making a local decision on the node itself. Also, maybe it's more extensible? When we get access to new & rich time-series data, it's easy to integrate into the upgrade process.

Lastly, it side steps the worries in my above comment, as 2 won't affect k8s load balancing decisions by design.

joshimhoff · 2021-01-20T14:55:36Z

Also yay to this & thank you for getting to it so fast!

tbg · 2021-01-21T15:06:54Z

@joshimhoff brings up the good point that we don't rigorously define "health". I think there are three:

health: process is running well enough to respond (or some other very basic check)
health?ready=1: health and the node is ready to serve traffic (i.e.: you can put it in your lb)
health?cluster=1: from the point of view of this node, the cluster is healthy.

The third one ties in more clearly with Josh's kvprober side project, and the queries Raphael added make sense in that context as well.

joshimhoff · 2021-01-21T20:35:00Z

I like that frame, @tbg! I have mixed feelings about the idea of deciding if a cluster is healthy from a single node (it might make more sense to make such a decision from a control plane component), but so long as health?ready=1 doesn't make such a decision, I have no concerns about causing surprising production behavior in CC or on-prem k8s deployments.

knz · 2021-01-21T21:40:14Z

Hm actually my PR is attempting to make another determination, separate from the 3 concepts Tobias presented already: whether the node is effectively connected to the rest of the cluster.

That is, it's not partitioned away.

tbg · 2021-01-22T07:22:21Z

Hm actually my PR is attempting to make another determination, separate from the 3 concepts Tobias presented already: whether the node is effectively connected to the rest of the cluster.

That is, it's not partitioned away.

Note that in health?cluster=1 I am saying "from the point of view of this node, the cluster is healthy". For a node to perceive the cluster as healthy, it needs to be connected to the cluster.

I have mixed feelings about the idea of deciding if a cluster is healthy from a single node

Right, and that too is acknowledged in the "from the point of view of this node" part. If you want the cluster to be healthy, on top of other checks, it should look healthy from all of the nodes, which the control plane could keep an eye on.

knz · 2021-01-23T12:41:47Z

Ok I'm going to split this PR in two:

one that just adds a check for the SQL listener, which is the case that Andrew and Josh cared about initially. This will be more compact and with less risk (and can be backported).
this one, where we're going to add the async checks discussed earlier in the review.

59146: bazel: generate `//go:generate stringer` files sandbox r=rickystewart a=alan-mas First part of #57787 work. As `gazelle` is not taking care by itself in any changes related to `//go:generate stringer` we are creating a `genrule` to handle it. From all the `//go:generate stringer` files, there are some that are having troubles during bazel build: ``` -pkg/util/encoding/encoding.go -pkg/util/encoding/BUILD.bazel "stringer: can't handle non-integer constant type Type" -pkg/workload/schemachange/schemachange.go -pkg/workload/schemachange/BUILD.bazel "stringer: can't happen: constant is not an integer TxStatusInFailure" ``` Release note: None 59327: build,bazel: inject nodejs support into our bazel setup r=rickystewart a=rickystewart Use the standard Google-supported Bazel tools to do so. This isn't plugged into anything yet, but you can verify that it works with `bazel query @npm//...`. Release note: None 59350: server: wait for SQL readiness in the `/health?ready=1` probe r=joshimhoff a=knz (smaller version than #59191) Fixes #58864. Planning to backport to v20.2. Release note (api change): the health API now checks that the SQL server is ready to accept clients when a readiness check is requested. Co-authored-by: Alanmas <[email protected]> Co-authored-by: Ricky Stewart <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]>

knz requested review from joshimhoff, ajwerner, tbg and irfansharif January 20, 2021 13:40

tbg reviewed Jan 20, 2021

View reviewed changes

ajwerner reviewed Jan 20, 2021

View reviewed changes

tbg reviewed Jan 20, 2021

View reviewed changes

knz commented Jan 20, 2021

View reviewed changes

ajwerner reviewed Jan 20, 2021

View reviewed changes

joshimhoff reviewed Jan 20, 2021

View reviewed changes

knz mentioned this pull request Jan 23, 2021

server: wait for SQL readiness in the /health?ready=1 probe #59350

Merged

knz mentioned this pull request Feb 1, 2021

server: new health report facility #59656

Draft

irfansharif removed their request for review February 11, 2021 18:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: add SQL checks to the `/health?ready=1` probe #59191

server: add SQL checks to the `/health?ready=1` probe #59191

knz commented Jan 20, 2021

cockroach-teamcity commented Jan 20, 2021

knz commented Jan 20, 2021

tbg Jan 20, 2021

ajwerner left a comment

tbg Jan 20, 2021

knz left a comment

ajwerner left a comment

ajwerner left a comment

knz commented Jan 20, 2021

ajwerner left a comment

knz commented Jan 20, 2021

knz commented Jan 20, 2021

ajwerner commented Jan 20, 2021

joshimhoff Jan 20, 2021

ajwerner Jan 20, 2021

joshimhoff Jan 20, 2021

chrisseto Jan 25, 2021

joshimhoff Feb 3, 2021

joshimhoff Jan 20, 2021

joshimhoff commented Jan 20, 2021

tbg commented Jan 21, 2021

joshimhoff commented Jan 21, 2021

knz commented Jan 21, 2021

tbg commented Jan 22, 2021

knz commented Jan 23, 2021

server: add SQL checks to the /health?ready=1 probe #59191

Are you sure you want to change the base?

server: add SQL checks to the /health?ready=1 probe #59191

Conversation

knz commented Jan 20, 2021

cockroach-teamcity commented Jan 20, 2021

knz commented Jan 20, 2021

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

knz commented Jan 20, 2021

ajwerner left a comment

Choose a reason for hiding this comment

knz commented Jan 20, 2021

knz commented Jan 20, 2021

ajwerner commented Jan 20, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joshimhoff commented Jan 20, 2021

tbg commented Jan 21, 2021

joshimhoff commented Jan 21, 2021

knz commented Jan 21, 2021

tbg commented Jan 22, 2021

knz commented Jan 23, 2021

server: add SQL checks to the `/health?ready=1` probe #59191

server: add SQL checks to the `/health?ready=1` probe #59191