Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: fix broken /health #45119

Merged
merged 1 commit into from
Mar 14, 2020
Merged

server: fix broken /health #45119

merged 1 commit into from
Mar 14, 2020

Conversation

knz
Copy link
Contributor

@knz knz commented Feb 14, 2020

Informs #44832.
Fixes #45020.

This patch fixes /health by ensuring it does not require
authentication, does not perform KV operations and does not expose
sensitive details.

Reminder, for posterity:

  • /_status/details exposes health, readiness and node details,
    which is potentially privileged information. This requires an
    authenticated admin user. It also performs a KV operation for
    authentication and thus subject to cluster availability.

  • /_admin/v1/health only exposes health and readiness. It
    is non-privileged and does not require authentication. It does
    not perform KV operations and is thus not subject to cluster
    availability.

  • /health is now an alias for /_admin/v1/health. (It used to be a
    non-authenticated alias for /_status/details which was both a UX
    and security bug. See release notes below for details.)

  • both /health and /_admin/v1/health accept a boolean flag e.g.
    via ?ready=1. When ready is specified, a HTTP error is returned if
    the node is up but not able to accept client connections, or not
    live, or shutting down. When ready is not specified, a HTTP success is always
    returned if the node is up, even when it cannot accept client
    connections or is not live.

Release justification: Category 3: Fixes for high-priority or high-severity bugs in existing functionality

Release note (security update): The non-authenticated /health HTTP
endpoint was previously exposing the private IP address of the node,
which can be privileged information in some deployments. This has been
corrected. Deployments using automation to retrieve a node build
details, address details etc should use /_status/details instead
and use a valid admin authentication cookie.

Release note (bug fix): Accesses to /health using a non-root
authentication token do not hang any more
when a node is currently under load or if a system range is
unavailable.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@knz knz force-pushed the 20200214-health branch 2 times, most recently from 1d10bb9 to 2ea2a6a Compare February 14, 2020 18:43
@knz knz force-pushed the 20200214-health branch 2 times, most recently from 366026a to d25028a Compare February 14, 2020 21:49
@knz knz requested a review from a team as a code owner February 15, 2020 01:19
@knz
Copy link
Contributor Author

knz commented Feb 19, 2020

@andreimatei FYI @DuskEagle and I chatted about the CC issue today and we agreed that will need further investigation, but also possibly using a different probe for liveness (not readiness) than a HTTP request.

Meanwhile I think this current PR is still beneficial for separate reasons (stated in the commit message). so if you could still give it a review that would be swell.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. It had bugged me that we were abusing that other endpoint for /health. Thank you.


// TODO(knz): we'd like to use nodeLiveness.IsLive() here
// but we can't, see https://github.com/cockroachdb/cockroach/issues/45123
l, err := s.server.nodeLiveness.GetLiveness(s.server.NodeID())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #45123, I think you want to use storage.LivenessStatus(.) (and no this should not just sit around in storage like that at all, but there it is).

}
return &serverpb.HealthResponse{}, nil
if l.Draining {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the problem with just using server.grpc.mode.get()? This is set synchronously with SetDraining() so I doubt it can really trail liveness (we might have to flip the order, but still). And even if it does trail, it will be by essentially nothing, not on any timescale that matters.

// ready specifies whether the client wants to know whether the
// target node is ready to receive traffic. If a node is unready, an
// error will be returned.
bool ready = 1;
}

// HealthResponse is the response to HealthRequest. It currently does not
// contain any information. The request fails however if the node is not live.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale comment

// error will be returned.
//
// Note: this is a more general version of the ready flag on
// HealthRequest.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just remove this? Users were always hitting /health directly (or at least I don't think we ever documented anything but), and those requests will go elsewhere now. If we're comfortable removing this I think we would want to.

@@ -424,11 +424,14 @@ message SettingsResponse {

// HealthRequest inquires whether the addressed node is healthy.
message HealthRequest {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's be specific about what this means since there was all of this ambiguity over in #44832. Something like:

HealthRequest requests a liveness or readiness check.

A liveness check is triggered via ready set to false. In this mode, an empty response is returned immediately, that is, the caller merely learns that the process is running.

A readiness check (ready == true) is suitable for determining whether user traffic should be directed at a given node, for example by a load balancer. In this mode, a successful response is returned only if the node

  • is not in the process of shutting down or booting up (including waiting for cluster bootstrap);
  • is regarded as healthy by the cluster via the recent broadcast of a liveness beacon.
    Absent either of these conditions, an error code will result.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PS if you know of a better place to describe this, I'm up for that too. Definitely also something we may want to carry over to docs once we've worked out the verbiage.

@joshimhoff
Copy link
Collaborator

@andreimatei FYI @DuskEagle and I chatted about the CC issue today and we agreed that will need further investigation, but also possibly using a different probe for liveness (not readiness) than a HTTP request.

Can you update #44832, especially re: the idea that it might be better to NOT use an HTTP probe for liveness? Thank you.

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd remove the second release note (or qualify it to refer to people who were using authentication, but I don't think that's anybody)

Is the "does not perform KV operations" part of the commit message accurate? Were there any "KV operations" after all?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @joshimhoff, @knz, and @tbg)


pkg/server/admin.go, line 1250 at r1 (raw file):

// Note: Health is non-privileged and non-authenticated and thus
// must not report privileged information.
// This is a simpler (non-privileged) version of Details().

I wouldn't reference Details(). The protos are different... There's little connection between the two.


pkg/server/admin.go, line 1284 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

See #45123, I think you want to use storage.LivenessStatus(.) (and no this should not just sit around in storage like that at all, but there it is).

I for one prefer staying away from LivenessStatus().
But I don't quite agree with the TODO. I'd use LivenessStatus() and check the Draining flag separately.

@joshimhoff joshimhoff removed their request for review March 12, 2020 16:17
This patch fixes `/health` by ensuring it does not require
authentication, does not perform KV operations and does not expose
sensitive details.

Reminder, for posterity:

- `/_status/details` exposes health, readiness and node details,
  which is potentially privileged information. This requires an
  authenticated admin user. It also performs a KV operation for
  authentication and thus subject to cluster availability.

- `/_admin/v1/health` only exposes health and readiness. It
  is non-privileged and does not require authentication. It does
  not perform KV operations and is thus not subject to cluster
  availability.

- `/health` is now an alias for `/_admin/v1/health`.  (It used to be a
  non-authenticated alias for `/_status/details/local` which was both a
  UX and security bug. See release notes below for details.)

- both `/health` and `/_admin/v1/health` accept a boolean flag e.g.
  via `?ready=1`. When `ready` is specified, a HTTP error is returned if
  the node is up but not able to accept client connections, or not live,
  or shutting down. When `ready` is *not* specified, a HTTP success is
  always returned if the node is up, even when it cannot accept client
  connections or is not live.

Release justification: Category 3: Fixes for high-priority or high-severity bugs in existing functionality

Release note (security update): The non-authenticated `/health` HTTP
endpoint was previously exposing the private IP address of the node,
which can be privileged information in some deployments. This has been
corrected. Deployments using automation to retrieve a node build
details, address details etc should use `/_status/details/local`
instead and use a valid admin authentication cookie.

Release note (bug fix): Accesses to `/health` using a non-root
authentication token do not hang any more when a node is currently
under load or if a system range is unavailable.
@knz knz force-pushed the 20200214-health branch from a8ee3bf to 125a058 Compare March 14, 2020 19:27
Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qualify it to refer to people who were using authentication, but I don't think that's anybody

Done.

Is the "does not perform KV operations" part of the commit message accurate? Were there any "KV operations" after all?

Yes if the user was using a non-root account to access the endpoint.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @tbg)


pkg/server/admin.go, line 1250 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I wouldn't reference Details(). The protos are different... There's little connection between the two.

Done.


pkg/server/admin.go, line 1284 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I for one prefer staying away from LivenessStatus().
But I don't quite agree with the TODO. I'd use LivenessStatus() and check the Draining flag separately.

Updated the TODO to refer to issue #45123, without specifics about what needs to happen.


pkg/server/admin.go, line 1291 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

What is the problem with just using server.grpc.mode.get()? This is set synchronously with SetDraining() so I doubt it can really trail liveness (we might have to flip the order, but still). And even if it does trail, it will be by essentially nothing, not on any timescale that matters.

Clarified in comment. There's a protocol difference between draining clients and draining leases.


pkg/server/serverpb/admin.proto, line 426 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

PS if you know of a better place to describe this, I'm up for that too. Definitely also something we may want to carry over to docs once we've worked out the verbiage.

Done.


pkg/server/serverpb/admin.proto, line 434 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Stale comment

Done.


pkg/server/serverpb/status.proto, line 91 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Can't we just remove this? Users were always hitting /health directly (or at least I don't think we ever documented anything but), and those requests will go elsewhere now. If we're comfortable removing this I think we would want to.

Done.

@knz
Copy link
Contributor Author

knz commented Mar 14, 2020

bors r=tbg,andreimatei

@craig
Copy link
Contributor

craig bot commented Mar 14, 2020

Build succeeded

@craig craig bot merged commit 33d7147 into cockroachdb:master Mar 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

server: /health endpoint should not do authenticatication
5 participants