-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: fix broken /health #45119
server: fix broken /health #45119
Conversation
1d10bb9
to
2ea2a6a
Compare
366026a
to
d25028a
Compare
@andreimatei FYI @DuskEagle and I chatted about the CC issue today and we agreed that will need further investigation, but also possibly using a different probe for liveness (not readiness) than a HTTP request. Meanwhile I think this current PR is still beneficial for separate reasons (stated in the commit message). so if you could still give it a review that would be swell. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. It had bugged me that we were abusing that other endpoint for /health
. Thank you.
|
||
// TODO(knz): we'd like to use nodeLiveness.IsLive() here | ||
// but we can't, see https://github.com/cockroachdb/cockroach/issues/45123 | ||
l, err := s.server.nodeLiveness.GetLiveness(s.server.NodeID()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #45123, I think you want to use storage.LivenessStatus(.)
(and no this should not just sit around in storage
like that at all, but there it is).
} | ||
return &serverpb.HealthResponse{}, nil | ||
if l.Draining { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the problem with just using server.grpc.mode.get()
? This is set synchronously with SetDraining()
so I doubt it can really trail liveness (we might have to flip the order, but still). And even if it does trail, it will be by essentially nothing, not on any timescale that matters.
pkg/server/serverpb/admin.proto
Outdated
// ready specifies whether the client wants to know whether the | ||
// target node is ready to receive traffic. If a node is unready, an | ||
// error will be returned. | ||
bool ready = 1; | ||
} | ||
|
||
// HealthResponse is the response to HealthRequest. It currently does not | ||
// contain any information. The request fails however if the node is not live. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stale comment
pkg/server/serverpb/status.proto
Outdated
// error will be returned. | ||
// | ||
// Note: this is a more general version of the ready flag on | ||
// HealthRequest. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we just remove this? Users were always hitting /health
directly (or at least I don't think we ever documented anything but), and those requests will go elsewhere now. If we're comfortable removing this I think we would want to.
@@ -424,11 +424,14 @@ message SettingsResponse { | |||
|
|||
// HealthRequest inquires whether the addressed node is healthy. | |||
message HealthRequest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's be specific about what this means since there was all of this ambiguity over in #44832. Something like:
HealthRequest requests a liveness or readiness check.
A liveness check is triggered via ready
set to false
. In this mode, an empty response is returned immediately, that is, the caller merely learns that the process is running.
A readiness check (ready == true
) is suitable for determining whether user traffic should be directed at a given node, for example by a load balancer. In this mode, a successful response is returned only if the node
- is not in the process of shutting down or booting up (including waiting for cluster bootstrap);
- is regarded as healthy by the cluster via the recent broadcast of a liveness beacon.
Absent either of these conditions, an error code will result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PS if you know of a better place to describe this, I'm up for that too. Definitely also something we may want to carry over to docs once we've worked out the verbiage.
Can you update #44832, especially re: the idea that it might be better to NOT use an HTTP probe for liveness? Thank you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd remove the second release note (or qualify it to refer to people who were using authentication, but I don't think that's anybody)
Is the "does not perform KV operations" part of the commit message accurate? Were there any "KV operations" after all?
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @joshimhoff, @knz, and @tbg)
pkg/server/admin.go, line 1250 at r1 (raw file):
// Note: Health is non-privileged and non-authenticated and thus // must not report privileged information. // This is a simpler (non-privileged) version of Details().
I wouldn't reference Details(). The protos are different... There's little connection between the two.
pkg/server/admin.go, line 1284 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
See #45123, I think you want to use
storage.LivenessStatus(.)
(and no this should not just sit around instorage
like that at all, but there it is).
I for one prefer staying away from LivenessStatus()
.
But I don't quite agree with the TODO. I'd use LivenessStatus()
and check the Draining
flag separately.
This patch fixes `/health` by ensuring it does not require authentication, does not perform KV operations and does not expose sensitive details. Reminder, for posterity: - `/_status/details` exposes health, readiness and node details, which is potentially privileged information. This requires an authenticated admin user. It also performs a KV operation for authentication and thus subject to cluster availability. - `/_admin/v1/health` only exposes health and readiness. It is non-privileged and does not require authentication. It does not perform KV operations and is thus not subject to cluster availability. - `/health` is now an alias for `/_admin/v1/health`. (It used to be a non-authenticated alias for `/_status/details/local` which was both a UX and security bug. See release notes below for details.) - both `/health` and `/_admin/v1/health` accept a boolean flag e.g. via `?ready=1`. When `ready` is specified, a HTTP error is returned if the node is up but not able to accept client connections, or not live, or shutting down. When `ready` is *not* specified, a HTTP success is always returned if the node is up, even when it cannot accept client connections or is not live. Release justification: Category 3: Fixes for high-priority or high-severity bugs in existing functionality Release note (security update): The non-authenticated `/health` HTTP endpoint was previously exposing the private IP address of the node, which can be privileged information in some deployments. This has been corrected. Deployments using automation to retrieve a node build details, address details etc should use `/_status/details/local` instead and use a valid admin authentication cookie. Release note (bug fix): Accesses to `/health` using a non-root authentication token do not hang any more when a node is currently under load or if a system range is unavailable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
qualify it to refer to people who were using authentication, but I don't think that's anybody
Done.
Is the "does not perform KV operations" part of the commit message accurate? Were there any "KV operations" after all?
Yes if the user was using a non-root account to access the endpoint.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @tbg)
pkg/server/admin.go, line 1250 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I wouldn't reference Details(). The protos are different... There's little connection between the two.
Done.
pkg/server/admin.go, line 1284 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I for one prefer staying away from
LivenessStatus()
.
But I don't quite agree with the TODO. I'd useLivenessStatus()
and check theDraining
flag separately.
Updated the TODO to refer to issue #45123, without specifics about what needs to happen.
pkg/server/admin.go, line 1291 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
What is the problem with just using
server.grpc.mode.get()
? This is set synchronously withSetDraining()
so I doubt it can really trail liveness (we might have to flip the order, but still). And even if it does trail, it will be by essentially nothing, not on any timescale that matters.
Clarified in comment. There's a protocol difference between draining clients and draining leases.
pkg/server/serverpb/admin.proto, line 426 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
PS if you know of a better place to describe this, I'm up for that too. Definitely also something we may want to carry over to docs once we've worked out the verbiage.
Done.
pkg/server/serverpb/admin.proto, line 434 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Stale comment
Done.
pkg/server/serverpb/status.proto, line 91 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Can't we just remove this? Users were always hitting
/health
directly (or at least I don't think we ever documented anything but), and those requests will go elsewhere now. If we're comfortable removing this I think we would want to.
Done.
bors r=tbg,andreimatei |
Build succeeded |
Informs #44832.
Fixes #45020.
This patch fixes
/health
by ensuring it does not requireauthentication, does not perform KV operations and does not expose
sensitive details.
Reminder, for posterity:
/_status/details
exposes health, readiness and node details,which is potentially privileged information. This requires an
authenticated admin user. It also performs a KV operation for
authentication and thus subject to cluster availability.
/_admin/v1/health
only exposes health and readiness. Itis non-privileged and does not require authentication. It does
not perform KV operations and is thus not subject to cluster
availability.
/health
is now an alias for/_admin/v1/health
. (It used to be anon-authenticated alias for
/_status/details
which was both a UXand security bug. See release notes below for details.)
both
/health
and/_admin/v1/health
accept a boolean flag e.g.via
?ready=1
. Whenready
is specified, a HTTP error is returned ifthe node is up but not able to accept client connections, or not
live, or shutting down. When
ready
is not specified, a HTTP success is alwaysreturned if the node is up, even when it cannot accept client
connections or is not live.
Release justification: Category 3: Fixes for high-priority or high-severity bugs in existing functionality
Release note (security update): The non-authenticated
/health
HTTPendpoint was previously exposing the private IP address of the node,
which can be privileged information in some deployments. This has been
corrected. Deployments using automation to retrieve a node build
details, address details etc should use
/_status/details
insteadand use a valid admin authentication cookie.
Release note (bug fix): Accesses to
/health
using a non-rootauthentication token do not hang any more
when a node is currently under load or if a system range is
unavailable.