Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: add SQL checks to the /health?ready=1 probe #59191

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

knz
Copy link
Contributor

@knz knz commented Jan 20, 2021

Fixes #58864.

Previously, it was possible for the /health?ready=1 probe
to return "OK" when the SQL service was unavailable.

This patch changes this by performing some crude, basic SQL checks on
system tables as part of the check.

Release note (api change): the health API now includes basic
SQL checks as part of readiness checks.

Previously, it was possible for the `/health?ready=1` probe
to return "OK" when the SQL service was unavailable.

This patch changes this by performing some crude, basic SQL checks on
system tables as part of the check.

Release note (api change): the health API now includes basic
SQL checks as part of readiness checks.
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@knz
Copy link
Contributor Author

knz commented Jan 20, 2021

Airing this out for discussion: what kind of SQL checks do we actually want in there?

`)
return err
}); err != nil {
// TODO(knz): When always-on tracing is available via SQL, we may
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need always-on tracing here - you can just trace it verbosely, period. The main reason not to is that we shouldn't verbosely trace anywhere until we have proper memory accounting, see https://github.com/cockroachdb/cockroach/projects/57#card-53177248

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @joshimhoff, @knz, and @tbg)


pkg/server/admin.go, line 1410 at r1 (raw file):

	timeoutCtx, cancel := context.WithTimeout(ctx, 1*time.Second)
	defer cancel()
	if err := s.server.RunLocalSQL(timeoutCtx, func(ctx context.Context, ie *sql.InternalExecutor) error {

I'm skeptical of this choice. In a geodistributed cluster, this may now take many hundreds of milliseconds.

I'm content with just the accepting clients bool as fixing this issue. I'm open to something additional but I don't think it should require issuing any RPCs let alone many.

// unavailable ranges translate to a health failure.
timeoutCtx, cancel := context.WithTimeout(ctx, 1*time.Second)
defer cancel()
if err := s.server.RunLocalSQL(timeoutCtx, func(ctx context.Context, ie *sql.InternalExecutor) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need some form of rate limiting here to ensure that this code doesn't trigger at a high rate? The health endpoint is the most accessible endpoint in the system.

Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the initial comments. Please keep them coming!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @irfansharif, @joshimhoff, and @tbg)


pkg/server/admin.go, line 1410 at r1 (raw file):

Previously, ajwerner wrote…

I'm skeptical of this choice. In a geodistributed cluster, this may now take many hundreds of milliseconds.

I'm content with just the accepting clients bool as fixing this issue. I'm open to something additional but I don't think it should require issuing any RPCs let alone many.

In a geo-distributed cluster, if looking at these system tables is taking "many hundreds of milliseconds" with AS OF SYSTEM TIME 5 seconds in the past, IMHO the cluster is not healthy. There'd be a need for a custom zone config in that case.


pkg/server/admin.go, line 1410 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Don't we need some form of rate limiting here to ensure that this code doesn't trigger at a high rate? The health endpoint is the most accessible endpoint in the system.

Good point! Will need to add that indeed.


pkg/server/admin.go, line 1431 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

You don't need always-on tracing here - you can just trace it verbosely, period. The main reason not to is that we shouldn't verbosely trace anywhere until we have proper memory accounting, see https://github.com/cockroachdb/cockroach/projects/57#card-53177248

Good point. Will include this in comment.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @joshimhoff, @knz, and @tbg)


pkg/server/admin.go, line 1410 at r1 (raw file):

Previously, knz (kena) wrote…

In a geo-distributed cluster, if looking at these system tables is taking "many hundreds of milliseconds" with AS OF SYSTEM TIME 5 seconds in the past, IMHO the cluster is not healthy. There'd be a need for a custom zone config in that case.

Distance from singapore to dc         >= 15,528 km
Speed of light in a fiber optic cable = 199861.2127778 km/s

Network RTT >= 2 * ((15,528 km) / (199861.2127778 km/s))
            >= 155.387 ms

There are a number of very valid reasons why this make not occur perfectly in parallel. The least of which is that we always resolve a bunch of system tables using system.namespace. Zone configs can't save you here.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @joshimhoff, @knz, and @tbg)


pkg/server/admin.go, line 1410 at r1 (raw file):

Previously, ajwerner wrote…
Distance from singapore to dc         >= 15,528 km
Speed of light in a fiber optic cable = 199861.2127778 km/s

Network RTT >= 2 * ((15,528 km) / (199861.2127778 km/s))
            >= 155.387 ms

There are a number of very valid reasons why this make not occur perfectly in parallel. The least of which is that we always resolve a bunch of system tables using system.namespace. Zone configs can't save you here.

Also, follower reads only exists on enterprise clusters and also depends heavily on cluster settings.

@knz
Copy link
Contributor Author

knz commented Jan 20, 2021

we always resolve a bunch of system tables using system.namespace. Zone configs can't save you here.

If I have one replica per region, then the AOST clause should ensure the nearest replica is hit and I never pay a cross-continent latency?

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if we assumed follower reads worked, which they may not, and that the historical timestamp was far enough in the past to use it, there are valid reasons to constrain your system tables to, say, the US or Europe even if you have nodes on different continents. Lastly, there are cases where you may need to go read the meta ranges which will be slow.

Perhaps that's actually a valid reason to not report ready on this endpoint and ideally startup does warm the range cache reasonably.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @joshimhoff, @knz, and @tbg)

@knz
Copy link
Contributor Author

knz commented Jan 20, 2021

Discussed with the KV pod: Tobias proposes to run the SQL query periodically (e.g. every second or 5 seconds), store its result in a global boolean, and read that boolean during the health probe.

This way we'd have a "SQL prober" task running continuously, and we could even afford adding additional checks into it over time, like Josh was doing in his other PR.

@knz
Copy link
Contributor Author

knz commented Jan 20, 2021

Another advantage: this way we can get periodic health reports on SQL in the logs, even when no tool is currently polling the health URL.

@ajwerner
Copy link
Contributor

I'm pleased with that.

@@ -1397,6 +1397,43 @@ func (s *adminServer) checkReadinessForHealthCheck() error {
return status.Errorf(codes.Unavailable, "node is shutting down")
}

if !s.server.sqlServer.acceptingClients.Get() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC @ajwerner. Would just the accepting clients check have been enough to detect the outage from last week caused by the rogue SQL migration (not sure that's actually the right language)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would have. We run the startup migrations before this point.

return status.Errorf(codes.Unavailable, "node is not accepting SQL clients")
}

// Also check the availability of the SQL service.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC @chrisseto to weigh in!

In CC, we use /health?ready=1 as a readiness probe (https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/). This probe is used in two places (I think just these two places):

  1. When doing upgrades, we roll a single node over, then wait until all nodes in the cluster are ready, meaning the probe passes on all nodes. Then we continue to another node.
  2. If a node starts failing the readiness probe, k8s will stop the node from "receiving traffic through Kubernetes Services" (https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-readiness-probes). It's actually not clear to me what this means with a stateful protocol like SQL; are existing connections dropped on the floor? I wonder if @chrisseto knows.

Re: 1, querying these system tables seems like a useful thing to do. We will halt bad upgrades earlier.

Re: 2, I am not so sure. All these system tables become critical dependencies. If one of em starts having problems, then our k8s setup may lead to a big drop in capacity, as all nodes will detect this at the same time. This may make the actual problem worse. The feedback loop introduced worries me.

What do folks think about this? There is no fundamental reason that we need to use the same probe for 1 & 2 btw. OTOH, k8s may make certain assumptions that make it somewhat finicky not to. Again @chrisseto to comment on that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's actually not clear to me what this means with a stateful protocol like SQL; are existing connections dropped on the floor?

Existing connections will not be dropped. The pod will just not receive any new connections from the load balancer. Though it gets a bit weird, as other nodes will still be able to discover this one (our service discover Service doesn't require readiness). So it would be possible for a node to be executing dist sql queries (Can someone fact check me here?).

There is no fundamental reason that we need to use the same probe for 1 & 2 btw. OTOH, k8s may make certain assumptions that make it somewhat finicky not to.

Can you say more about this? AFAIK they have to be the same probe and they should be the same probe from k8s perspective. Do you mean that we could have a manual probe for the rolling upgrade?

My main question is whether or not this check adds a dependency on other nodes being live/ready. If I have an 8 node cluster and only 3 nodes are up, will all nodes become unready?

Our Statefulsets are setup to manage all pods in parallel and nodes will be able to connect to unready nodes, so I don't think this will have any unintended consequences.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing connections will not be dropped.

Very interesting!

return status.Errorf(codes.Unavailable, "node is not accepting SQL clients")
}

// Also check the availability of the SQL service.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative design to consider:

  1. Here we do a basic minimum check: !s.server.sqlServer.acceptingClients.Get. If the probe fails, we halt the upgrade.
  2. In the CC / k8s operator upgrade automation, we check time-series metrics during the upgrade. If the metrics look bad, e.g. if they look different between the updated node & a node running the old version, we halt the upgrade.

Why check metrics? Well, the aggregations we can do in prometheus may lead to higher signal to noise than making a local decision on the node itself. Also, maybe it's more extensible? When we get access to new & rich time-series data, it's easy to integrate into the upgrade process.

Lastly, it side steps the worries in my above comment, as 2 won't affect k8s load balancing decisions by design.

@joshimhoff
Copy link
Collaborator

Also yay to this & thank you for getting to it so fast!

@tbg
Copy link
Member

tbg commented Jan 21, 2021

@joshimhoff brings up the good point that we don't rigorously define "health". I think there are three:

  • health: process is running well enough to respond (or some other very basic check)
  • health?ready=1: health and the node is ready to serve traffic (i.e.: you can put it in your lb)
  • health?cluster=1: from the point of view of this node, the cluster is healthy.

The third one ties in more clearly with Josh's kvprober side project, and the queries Raphael added make sense in that context as well.

@joshimhoff
Copy link
Collaborator

I like that frame, @tbg! I have mixed feelings about the idea of deciding if a cluster is healthy from a single node (it might make more sense to make such a decision from a control plane component), but so long as health?ready=1 doesn't make such a decision, I have no concerns about causing surprising production behavior in CC or on-prem k8s deployments.

@knz
Copy link
Contributor Author

knz commented Jan 21, 2021

Hm actually my PR is attempting to make another determination, separate from the 3 concepts Tobias presented already: whether the node is effectively connected to the rest of the cluster.

That is, it's not partitioned away.

@tbg
Copy link
Member

tbg commented Jan 22, 2021

Hm actually my PR is attempting to make another determination, separate from the 3 concepts Tobias presented already: whether the node is effectively connected to the rest of the cluster.

That is, it's not partitioned away.

Note that in health?cluster=1 I am saying "from the point of view of this node, the cluster is healthy". For a node to perceive the cluster as healthy, it needs to be connected to the cluster.

I have mixed feelings about the idea of deciding if a cluster is healthy from a single node

Right, and that too is acknowledged in the "from the point of view of this node" part. If you want the cluster to be healthy, on top of other checks, it should look healthy from all of the nodes, which the control plane could keep an eye on.

@knz
Copy link
Contributor Author

knz commented Jan 23, 2021

Ok I'm going to split this PR in two:

  • one that just adds a check for the SQL listener, which is the case that Andrew and Josh cared about initially. This will be more compact and with less risk (and can be backported).
  • this one, where we're going to add the async checks discussed earlier in the review.

craig bot pushed a commit that referenced this pull request Jan 25, 2021
59146: bazel: generate `//go:generate stringer` files sandbox r=rickystewart a=alan-mas

First part of #57787 work.

As `gazelle` is not taking care by itself in any changes related to `//go:generate stringer` we are creating a
`genrule` to handle it.

From all the `//go:generate stringer` files, there are some that are having troubles during bazel build:
```
-pkg/util/encoding/encoding.go
-pkg/util/encoding/BUILD.bazel
"stringer: can't handle non-integer constant type Type"

-pkg/workload/schemachange/schemachange.go
-pkg/workload/schemachange/BUILD.bazel
"stringer: can't happen: constant is not an integer TxStatusInFailure"
```

Release note: None

59327: build,bazel: inject nodejs support into our bazel setup r=rickystewart a=rickystewart

Use the standard Google-supported Bazel tools to do so. This isn't
plugged into anything yet, but you can verify that it works with `bazel
query @npm//...`.

Release note: None

59350: server: wait for SQL readiness in the `/health?ready=1` probe r=joshimhoff a=knz

(smaller version than #59191) 
Fixes #58864.
Planning to backport to v20.2.

Release note (api change): the health API now checks that the SQL
server is ready to accept clients when a readiness check is requested.

Co-authored-by: Alanmas <[email protected]>
Co-authored-by: Ricky Stewart <[email protected]>
Co-authored-by: Raphael 'kena' Poss <[email protected]>
@irfansharif irfansharif removed their request for review February 11, 2021 18:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

server: need to change /health?ready=1 to returns success only after the node can serve SQL
6 participants