server: fix broken /health #45119

knz · 2020-02-14T16:50:09Z

This patch fixes /health by ensuring it does not require
authentication, does not perform KV operations and does not expose
sensitive details.

Reminder, for posterity:

/_status/details exposes health, readiness and node details,
which is potentially privileged information. This requires an
authenticated admin user. It also performs a KV operation for
authentication and thus subject to cluster availability.
/_admin/v1/health only exposes health and readiness. It
is non-privileged and does not require authentication. It does
not perform KV operations and is thus not subject to cluster
availability.
/health is now an alias for /_admin/v1/health. (It used to be a
non-authenticated alias for /_status/details which was both a UX
and security bug. See release notes below for details.)
both /health and /_admin/v1/health accept a boolean flag e.g.
via ?ready=1. When ready is specified, a HTTP error is returned if
the node is up but not able to accept client connections, or not
live, or shutting down. When ready is not specified, a HTTP success is always
returned if the node is up, even when it cannot accept client
connections or is not live.

Release justification: Category 3: Fixes for high-priority or high-severity bugs in existing functionality

Release note (security update): The non-authenticated /health HTTP
endpoint was previously exposing the private IP address of the node,
which can be privileged information in some deployments. This has been
corrected. Deployments using automation to retrieve a node build
details, address details etc should use /_status/details instead
and use a valid admin authentication cookie.

Release note (bug fix): Accesses to /health using a non-root
authentication token do not hang any more
when a node is currently under load or if a system range is
unavailable.

cockroach-teamcity · 2020-02-14T16:50:19Z

This change is

knz · 2020-02-19T02:35:42Z

@andreimatei FYI @DuskEagle and I chatted about the CC issue today and we agreed that will need further investigation, but also possibly using a different probe for liveness (not readiness) than a HTTP request.

Meanwhile I think this current PR is still beneficial for separate reasons (stated in the commit message). so if you could still give it a review that would be swell.

tbg

Nice. It had bugged me that we were abusing that other endpoint for /health. Thank you.

tbg · 2020-02-19T13:52:44Z

pkg/server/admin.go

+
+	// TODO(knz): we'd like to use nodeLiveness.IsLive() here
+	// but we can't, see https://github.com/cockroachdb/cockroach/issues/45123
+	l, err := s.server.nodeLiveness.GetLiveness(s.server.NodeID())


See #45123, I think you want to use storage.LivenessStatus(.) (and no this should not just sit around in storage like that at all, but there it is).

tbg · 2020-02-19T13:55:48Z

pkg/server/admin.go

 	}
-	return &serverpb.HealthResponse{}, nil
+	if l.Draining {


What is the problem with just using server.grpc.mode.get()? This is set synchronously with SetDraining() so I doubt it can really trail liveness (we might have to flip the order, but still). And even if it does trail, it will be by essentially nothing, not on any timescale that matters.

tbg · 2020-02-19T14:02:54Z

pkg/server/serverpb/admin.proto

+  // ready specifies whether the client wants to know whether the
+  // target node is ready to receive traffic. If a node is unready, an
+  // error will be returned.
+  bool ready = 1;
 }

 // HealthResponse is the response to HealthRequest. It currently does not
 // contain any information. The request fails however if the node is not live.


Stale comment

tbg · 2020-02-19T14:07:04Z

pkg/server/serverpb/status.proto

+  // error will be returned.
+  //
+  // Note: this is a more general version of the ready flag on
+  // HealthRequest.


Can't we just remove this? Users were always hitting /health directly (or at least I don't think we ever documented anything but), and those requests will go elsewhere now. If we're comfortable removing this I think we would want to.

tbg · 2020-02-19T14:19:39Z

pkg/server/serverpb/admin.proto

@@ -424,11 +424,14 @@ message SettingsResponse {

 // HealthRequest inquires whether the addressed node is healthy.
 message HealthRequest {


Let's be specific about what this means since there was all of this ambiguity over in #44832. Something like:

HealthRequest requests a liveness or readiness check.

A liveness check is triggered via ready set to false. In this mode, an empty response is returned immediately, that is, the caller merely learns that the process is running.

A readiness check (ready == true) is suitable for determining whether user traffic should be directed at a given node, for example by a load balancer. In this mode, a successful response is returned only if the node

is not in the process of shutting down or booting up (including waiting for cluster bootstrap);

is regarded as healthy by the cluster via the recent broadcast of a liveness beacon.
Absent either of these conditions, an error code will result.

PS if you know of a better place to describe this, I'm up for that too. Definitely also something we may want to carry over to docs once we've worked out the verbiage.

joshimhoff · 2020-02-19T14:22:52Z

@andreimatei FYI @DuskEagle and I chatted about the CC issue today and we agreed that will need further investigation, but also possibly using a different probe for liveness (not readiness) than a HTTP request.

Can you update #44832, especially re: the idea that it might be better to NOT use an HTTP probe for liveness? Thank you.

andreimatei

nit: I'd remove the second release note (or qualify it to refer to people who were using authentication, but I don't think that's anybody)

Is the "does not perform KV operations" part of the commit message accurate? Were there any "KV operations" after all?

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @joshimhoff, @knz, and @tbg)

pkg/server/admin.go, line 1250 at r1 (raw file):

// Note: Health is non-privileged and non-authenticated and thus
// must not report privileged information.
// This is a simpler (non-privileged) version of Details().

I wouldn't reference Details(). The protos are different... There's little connection between the two.

pkg/server/admin.go, line 1284 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

See #45123, I think you want to use storage.LivenessStatus(.) (and no this should not just sit around in storage like that at all, but there it is).

I for one prefer staying away from LivenessStatus().
But I don't quite agree with the TODO. I'd use LivenessStatus() and check the Draining flag separately.

This patch fixes `/health` by ensuring it does not require authentication, does not perform KV operations and does not expose sensitive details. Reminder, for posterity: - `/_status/details` exposes health, readiness and node details, which is potentially privileged information. This requires an authenticated admin user. It also performs a KV operation for authentication and thus subject to cluster availability. - `/_admin/v1/health` only exposes health and readiness. It is non-privileged and does not require authentication. It does not perform KV operations and is thus not subject to cluster availability. - `/health` is now an alias for `/_admin/v1/health`. (It used to be a non-authenticated alias for `/_status/details/local` which was both a UX and security bug. See release notes below for details.) - both `/health` and `/_admin/v1/health` accept a boolean flag e.g. via `?ready=1`. When `ready` is specified, a HTTP error is returned if the node is up but not able to accept client connections, or not live, or shutting down. When `ready` is *not* specified, a HTTP success is always returned if the node is up, even when it cannot accept client connections or is not live. Release justification: Category 3: Fixes for high-priority or high-severity bugs in existing functionality Release note (security update): The non-authenticated `/health` HTTP endpoint was previously exposing the private IP address of the node, which can be privileged information in some deployments. This has been corrected. Deployments using automation to retrieve a node build details, address details etc should use `/_status/details/local` instead and use a valid admin authentication cookie. Release note (bug fix): Accesses to `/health` using a non-root authentication token do not hang any more when a node is currently under load or if a system range is unavailable.

knz

qualify it to refer to people who were using authentication, but I don't think that's anybody

Done.

Is the "does not perform KV operations" part of the commit message accurate? Were there any "KV operations" after all?

Yes if the user was using a non-root account to access the endpoint.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @tbg)

pkg/server/admin.go, line 1250 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I wouldn't reference Details(). The protos are different... There's little connection between the two.

Done.

pkg/server/admin.go, line 1284 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I for one prefer staying away from LivenessStatus().
But I don't quite agree with the TODO. I'd use LivenessStatus() and check the Draining flag separately.

Updated the TODO to refer to issue #45123, without specifics about what needs to happen.

pkg/server/admin.go, line 1291 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

What is the problem with just using server.grpc.mode.get()? This is set synchronously with SetDraining() so I doubt it can really trail liveness (we might have to flip the order, but still). And even if it does trail, it will be by essentially nothing, not on any timescale that matters.

Clarified in comment. There's a protocol difference between draining clients and draining leases.

pkg/server/serverpb/admin.proto, line 426 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

PS if you know of a better place to describe this, I'm up for that too. Definitely also something we may want to carry over to docs once we've worked out the verbiage.

Done.

pkg/server/serverpb/admin.proto, line 434 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Stale comment

Done.

pkg/server/serverpb/status.proto, line 91 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Can't we just remove this? Users were always hitting /health directly (or at least I don't think we ever documented anything but), and those requests will go elsewhere now. If we're comfortable removing this I think we would want to.

Done.

knz · 2020-03-14T20:34:02Z

bors r=tbg,andreimatei

craig · 2020-03-14T21:08:48Z

Build succeeded

GitHub CI (Cockroach)

knz requested review from andreimatei and joshimhoff February 14, 2020 16:50

knz force-pushed the 20200214-health branch 2 times, most recently from 1d10bb9 to 2ea2a6a Compare February 14, 2020 18:43

knz mentioned this pull request Feb 14, 2020

CockroachDB: Health checks time out under load #44832

Closed

knz force-pushed the 20200214-health branch 2 times, most recently from 366026a to d25028a Compare February 14, 2020 21:49

knz mentioned this pull request Feb 14, 2020

server: avoid considering non-authenticated reqs as root-authenticated #45125

Closed

knz force-pushed the 20200214-health branch from d25028a to 0cb85f2 Compare February 15, 2020 01:19

knz requested a review from a team as a code owner February 15, 2020 01:19

knz force-pushed the 20200214-health branch from 0cb85f2 to a8ee3bf Compare February 15, 2020 01:47

knz added backport-19.1.x labels Feb 15, 2020

tbg approved these changes Feb 19, 2020

View reviewed changes

andreimatei reviewed Feb 19, 2020

View reviewed changes

joshimhoff removed their request for review March 12, 2020 16:17

knz force-pushed the 20200214-health branch from a8ee3bf to 125a058 Compare March 14, 2020 19:27

knz commented Mar 14, 2020

View reviewed changes

craig bot merged commit 33d7147 into cockroachdb:master Mar 14, 2020

knz mentioned this pull request Mar 24, 2020

release-19.2: cockroach init and /health fixes #46477

Merged

jseldess mentioned this pull request Mar 24, 2020

server: fix broken /health cockroachdb/docs#6968

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: fix broken /health #45119

server: fix broken /health #45119

knz commented Feb 14, 2020 •

edited

Loading

cockroach-teamcity commented Feb 14, 2020

knz commented Feb 19, 2020

tbg left a comment

tbg Feb 19, 2020

tbg Feb 19, 2020

tbg Feb 19, 2020

tbg Feb 19, 2020

tbg Feb 19, 2020

tbg Feb 19, 2020

joshimhoff commented Feb 19, 2020

andreimatei left a comment

knz left a comment

knz commented Mar 14, 2020

craig bot commented Mar 14, 2020

		@@ -424,11 +424,14 @@ message SettingsResponse {

		// HealthRequest inquires whether the addressed node is healthy.
		message HealthRequest {

server: fix broken /health #45119

server: fix broken /health #45119

Conversation

knz commented Feb 14, 2020 • edited Loading

cockroach-teamcity commented Feb 14, 2020

knz commented Feb 19, 2020

tbg left a comment

Choose a reason for hiding this comment

tbg Feb 19, 2020

Choose a reason for hiding this comment

tbg Feb 19, 2020

Choose a reason for hiding this comment

tbg Feb 19, 2020

Choose a reason for hiding this comment

tbg Feb 19, 2020

Choose a reason for hiding this comment

tbg Feb 19, 2020

Choose a reason for hiding this comment

tbg Feb 19, 2020

Choose a reason for hiding this comment

joshimhoff commented Feb 19, 2020

andreimatei left a comment

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

knz commented Mar 14, 2020

craig bot commented Mar 14, 2020

Build succeeded

knz commented Feb 14, 2020 •

edited

Loading