Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CockroachDB: Health checks time out under load #44832

Closed
DuskEagle opened this issue Feb 6, 2020 · 54 comments
Closed

CockroachDB: Health checks time out under load #44832

DuskEagle opened this issue Feb 6, 2020 · 54 comments
Assignees
Labels
A-cc-enablement Pertains to current CC production issues or short-term projects A-kv-server Relating to the KV-level RPC server A-security C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-server-and-security DB Server & Security X-server-triaged-202105

Comments

@DuskEagle
Copy link
Member

DuskEagle commented Feb 6, 2020

When a CockroachDB node is near 100% CPU usage, requests to any of the health check endpoints (/health, /health?ready=1 or /_admin/v1/health) will sometimes hang. In our example Kubernetes manifests, we have a timeout on our health checks of 1 second, but I have observed the endpoint fail to respond for 20+ seconds. The node is still otherwise up and able to process SQL queries.

This is a major problem when using an HTTP request to one of these endpoints as a Kubernetes liveness probe. When Kubernetes detects multiple liveness probes fail in a row, it believes that the CockroachDB container has crashed and will restart the container. This results downtime on a single-node cluster. On a three-node cluster, all three liveness probes can fail within a short period of time, also resulting in downtime.

Up until now, we have been using HTTP requests to the /health endpoint as our liveness probe. This is the behaviour specified within our example Kubernetes manifests.

To reproduce this behaviour, run cockroach workload init tpcc --warehouses=100 on a single-node GCP n1-standard-2 cluster (exact number of warehouses for a perfect repro TBD). On a Kubernetes cluster, our default liveness probe will fail, resulting in the CockroachDB container being restarted and the workload init command failing. If instead, you change the liveness probe to be a TCP check against either the HTTP or GRPC port, or remove the liveness probe entirely, the liveness probe will not fail and the workload init command will succeed. Outside of a Kubernetes environment, you can also reproduce this issue by repeatedly sending HTTP requests to a health check endpoint and measuring the response time.

Ideally, what we would like from CockroachDB are two health check endpoints:

  • One for use as the liveness probe, which always responds healthy unless the application needs to be restarted.

  • One for use as the readiness probe, which responds healthy if and only if that CRDB pod is able to receive requests. We currently use /health?ready=1 as our readiness probe endpoint. It will also time out under load, but that is less severe than a liveness probe timing out. Kubernetes will stop sending traffic directed to the CockroachDB service to any pod whose readiness probe is failing until the readiness probe recovers.

In the meantime, we have removed the liveness probes from our Kubernetes deployments of CockroachDB, which has helped keep CockroachDB running under load (Kubernetes will still restart the CockroachDB container if CockroachDB actually crashes). We should consider recommending this for other customers deploying CockroachDB on top of Kubernetes.

Epic: CRDB-549

Jira issue: CRDB-5200

@DuskEagle
Copy link
Member Author

cc @andy-kimball and @nvanbenschoten.

@andy-kimball andy-kimball self-assigned this Feb 6, 2020
@andy-kimball
Copy link
Contributor

@DuskEagle, would the liveness probe typically need to do anything but just always return "healthy"? In other words, are there other conditions we might want to check beyond the implicit condition that the process is healthy enough to trivially respond to an HTTP request? It's hard to think of cases where the process is still up, and yet thinks things are so dire that we should kill the entire container.

+@tbg, since he's been thinking about liveness in rolling upgrade cases, which seems somewhat related. I'd like us to have very crisp definitions for each term, like "healthy" or "ready" or "available", etc.

@DuskEagle
Copy link
Member Author

@andy-kimball CockroachDB has the node liveness heartbeat which causes an instance to shut down if the heartbeat times out. This is already performing much of the same function as a liveness probe is intended to. If the node liveness heartbeat didn't exist, I would see the value in the liveness probe. I'm struggling to think of a case where the liveness probe is all that useful given the node liveness heartbeat. I think removing the liveness probe makes sense given that.

In terms of definitions for terms, here is how terms are defined in Kubernetes:

Liveness: The container is running and is "healthy". If this is false, Kubernetes will restart the container. The definition of "healthy" is application-dependent, but an unhealthy container is in a state where a restart is required.

Readiness: The container is able to serve requests. If this is false, Kubernetes will remove the pod from the endpoints of all services that match the pod. In other words, traffic directed at the load balancer for the application will not be routed to this pod.

@joshimhoff
Copy link
Collaborator

joshimhoff commented Feb 7, 2020

CockroachDB has the node liveness heartbeat which causes an instance to shut down if the heartbeat times out. This is already performing much of the same function as a liveness probe is intended to. If the node liveness heartbeat didn't exist, I would see the value in the liveness probe. I'm struggling to think of a case where the liveness probe is all that useful given the node liveness heartbeat. I think removing the liveness probe makes sense given that.

This is an interesting point re: node liveness! I think I agree with the conclusion (that CRDB implements a kind of liveness probe and that this means a k8s level liveness probe doesn't add value), though it makes me realize I don't understand node liveness as well as I want to. To fully agree with the conclusion, I want to understand it better.

Can someone explain more concretely what it means that "node liveness heartbeat which causes an instance to shut down if the heartbeat times out"? My understanding is that if a node's liveness record is not updated (via a write to a system range) then the REST of the cluster considers it down / it loses leases / etc. I didn't know that a node would crash itself if it starts failing to update its liveness record. This makes sense to me, though also I find it a bit confusing. Couldn't a node fail to update its liveness record NOT because it's having issues but instead because OTHER nodes are having issues? I'd expect that to be the case, since IIUC writing the liveness record is a KV level write and thus requires quorum?

This detail (from cockroachdb/docs@a054377) also confuses me: "This is achieved using checks that ensure that each node connected to the cluster is updating its liveness record. This information is shared with the rest of the cluster using an internal gossip protocol." How does gossip come into play exactly?

@andreimatei
Copy link
Contributor

@andy-kimball CockroachDB has the node liveness heartbeat which causes an instance to shut down if the heartbeat times out. This is already performing much of the same function as a liveness probe is intended to. If the node liveness heartbeat didn't exist, I would see the value in the liveness probe. I'm struggling to think of a case where the liveness probe is all that useful given the node liveness heartbeat. I think removing the liveness probe makes sense given that.

I think there's some confusion here. A CRDB node does not kill itself for reasons related to its node liveness record. Nodes will continue running happily if they fail to heartbeat their record.

I came here because someone else is complaining about this issue. Joel, could you get some data for one of these humongous 20s latencies? I guess a CPU profile while that is going on, and also a go runtime trace, perhaps some extra logs around here.
I'd find it very curious if we straight up are not able to schedule a goroutine to serve that trivial request for 20s. Particularly since this request is coming on the HTTP port, not on the RPC port. So I wonder if we just have some unintended hidden contention somewhere in this request's path.
Since recently we've added authentication for this request, which might add some arbitrary latency (#45020), but I don't think that's the cause here.

@vladdy
Copy link
Contributor

vladdy commented Feb 12, 2020

@andreimatei I've started looking into this to get better understanding of this behavior as well. I'll get the data you need.

@christianhuening
Copy link

we're also having issues with that, please push this!

@knz
Copy link
Contributor

knz commented Feb 14, 2020

I found the cause (see #45020) and will send a PR.

@knz knz added A-security C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-server Relating to the KV-level RPC server labels Feb 14, 2020
@joshimhoff
Copy link
Collaborator

Awesome! We'd appreciate a backport, as this is causing regular prod issues on CC clusters.

@christianhuening
Copy link

Hell yeah! We also need this fixed as soon as possible.

@knz
Copy link
Contributor

knz commented Feb 14, 2020

understood.

@andreimatei
Copy link
Contributor

Are you sure the authentication explains this? I think people are complaining about the /health endpoint, which did not do authentication, no?

@knz
Copy link
Contributor

knz commented Feb 14, 2020

Are you sure the authentication explains this? I think people are complaining about the /health endpoint, which did not do authentication, no?

/_status/v1/details performs an authz check, which performs a KV lookup to system.users and/or system.role_members. Before #45119 gets merged, /health points to ...details and that would explain. With #45119 merged, /health becomes alias for /_admin/v1/health which does not require auth and thus becomes lightweight.

@andreimatei
Copy link
Contributor

Sorry, I'm confused. I think I verified in #45018 (comment) that /health does not currently perform authentication. It pretends to, but we take an early return path and run as root.

@knz
Copy link
Contributor

knz commented Feb 14, 2020

I'm not talking about authentication. /health does not authenticate, as you've previously established, even though its target /_status/v1/details would otherwise.

The key point is that both /health and /_status/v1/details perform an authorization step. They retrieve the admin bit from system.users and system.role_members.

We want that for /_status/v1/details which is privileged, but we don't want it for /health. This is avoided by #45119.

@knz
Copy link
Contributor

knz commented Feb 14, 2020

FWIW in #45119 I have the following:

  • /health without ?ready=1 returns OK as long as the process is running and the HTTP listener is active

  • /health?ready=1 will check liveness and health, and will return 503 in case of a problem, with the following details:

    • node is waiting for cluster initialization when currently waiting on cockroach init
    • node is shutting down if currently draining
    • node is not healthy if liveness not updated recently

Note that neither prior to this change nor with it, does /health?ready=1 check availability of ranges in KV. It relies on the liveness bit to reflect that a recent heartbeat (incl KV write) of the node descriptor has succeeded.

@knz
Copy link
Contributor

knz commented Feb 14, 2020

@vladdy @DuskEagle after discussion with Andrei, we still find it useful to extract a CPU profile when the health endpoint seems to have a high latency. Please provide this to us regardless of progress made on #45119 - thanks.

@tbg
Copy link
Member

tbg commented Feb 19, 2020

I'm a bit lost on what the latest on these timeout issues is (comments here indicate that it was solved, but we discussed it yesterday in KV weekly and it seemed mysterious still). A small recap would be nice. Either way, I just reviewed #45119 and it's how things ought to work. Don't restart the pod until /health fails (and this is just a noop request today), and determine whether to send client traffic to a node based on /health?ready=true.

There is no acceptable reason for /health to take 20+s unless perhaps the system is severely, severely overloaded. But then it should also not be possible to do anything else with the node in the meantime.

Going back a few messages to #44832 (comment) about the crisp definitions, they are problematic right now because we use the term "liveness" at different levels, all while basing it mostly on the NodeLiveness subsystem, which is too low-level in most cases but name-squats pretty effectively on the liveness concept. This is a bit unfortunate and we can change it over time, but as far as external communication goes, we should stick to the k8s verbiage in which a pod is "live" unless it's really horked up, but ready only when it should get served requests, which means (today) having a recent heartbeat on the liveness record and not being about to shut down (and in the future should include things like #44206). That's also the status quo after #45119, so externally we should be good (assuming the docs reflect that).

I commented on some of the internal confusion more concretely in

#45123 (comment).

@joshimhoff
Copy link
Collaborator

There is no acceptable reason for /health to take 20+s unless perhaps the system is severely, severely overloaded. But then it should also not be possible to do anything else with the node in the meantime.

I think we certainly need to dig deeper into root cause, but on cloud, we are seeing liveness probe failures when a node is overloaded. Out of CPU, nearly out of memory. Often a customer is doing functional testing and is hitting CRDB hard on purpose as a stress test. Not sure if this is what you imagine when you say "severely, severely overloaded", but I guess Joel's repro instructions at the top of this issue can make concrete how overloaded CRDB is when liveness probes start failing.

I think it's a better experience for users if when CRDB is out of resources, serving requests starts taking more time but doesn't lead to total unavailability due to k8s killing the pods.

@knz
Copy link
Contributor

knz commented Feb 19, 2020

I have discussed this with @DuskEagle yesterday. Here's the summary of our conversation:

  • what tobias said about the difference between /health and /health?ready=1 (this was already largely known)
  • agreement/acknowledgement that /health, being served by a Goroutine in a cooperatively scheduled system, can suffer from arbitrary delays when the system is overloaded. This is a "feature" of the Go runtime.
  • From @DuskEagle we need a probe that ascertains the process is still running, albeit overloaded. A probe that can suffer arbitrary delays just because the process is slow "doesn't match the requirement".
  • agreement/acknowledgement that asking the system being monitored to report on its liveness (not readiness!) is conceptually flawed

Conclusion: proposal to decide liveness based on another metric that can be sampled "outside" of the process without suffering from the vagaries of the go scheduler:

  • check that the process is still running (k8s does this)
  • check that it has performed disk I/O recently (an idle node still performs disk i/o for logging every 1-10 second IIRC)
  • perhaps other OS-level checks that we can determine together

@andreimatei
Copy link
Contributor

I'm a bit lost on what the latest on these timeout issues is (comments here indicate that it was solved, but we discussed it yesterday in KV weekly and it seemed mysterious still). A small recap would be nice.

The recap is that we're at square one - no idea why this endpoint is not responding for "20+ seconds".

agreement/acknowledgement that /health, being served by a Goroutine in a cooperatively scheduled system, can suffer from arbitrary delays when the system is overloaded. This is a "feature" of the Go runtime.
Conclusion: proposal to decide liveness based on another metric that can be sampled "outside" of the process without suffering from the vagaries of the go scheduler:

Well, hold up one second :). Let's not give up quite that easy. Although in theory one can create arbitrary-length run queues, I don't think we've really seen that yet in CRDB, particularly not across requests coming on different sockets. Case in point: when we moved the liveness heartbeats on a different gRPC connection, they became a lot more reliable under heavy load - basically I don't think we can get them to take over a few secs any more regardless of CPU load (different story for IO contention, but that's not the case for these probes).
These health checks are already on a different connection from something else, so you'd think that helps them cut some lines. Except maybe it's the fact that a new TLS connection is established every time that kills them. I've tried to see if Kubernetes can be configured to reuse a connection, and I couldn't see how. I think we should try removing the TLS from the equation and seeing if that helps. If it doesn't, let's run experiments with reusing connections manually.
And generally let's at least look at some runtime traces before declaring defeat.
@DuskEagle and @joshimhoff , do you mind playing with this some more?

A thing that we've seen is pathological behavior when the GOMAXPROCS is not configured correctly. That's not the case for CC, is it?
The CC nodes we're talking about have at least a few CPUs, right? And there's no swapping?

@yangxuanjia

This comment has been minimized.

@knz

This comment has been minimized.

craig bot pushed a commit that referenced this issue Mar 14, 2020
45119: server: fix broken /health r=tbg,andreimatei a=knz

Informs #44832.
Fixes #45020.

This patch fixes `/health` by ensuring it does not require
authentication, does not perform KV operations and does not expose
sensitive details.

Reminder, for posterity:

- `/_status/details` exposes health, readiness and node details,
  which is potentially privileged information. This requires an
  authenticated admin user. It also performs a KV operation for
  authentication and thus subject to cluster availability.

- `/_admin/v1/health` only exposes health and readiness. It
  is non-privileged and does not require authentication. It does
  not perform KV operations and is thus not subject to cluster
  availability.

- `/health` is now an alias for `/_admin/v1/health`.  (It used to be a
  non-authenticated alias for `/_status/details` which was both a UX
  and security bug. See release notes below for details.)

- both `/health` and `/_admin/v1/health` accept a boolean flag e.g.
  via `?ready=1`. When `ready` is specified, a HTTP error is returned if
  the node is up but not able to accept client connections, or not
  live, or shutting down. When `ready` is *not* specified, a HTTP success is always
  returned if the node is up, even when it cannot accept client
  connections or is not live.

Release justification: Category 3: Fixes for high-priority or high-severity bugs in existing functionality

Release note (security update): The non-authenticated `/health` HTTP
endpoint was previously exposing the private IP address of the node,
which can be privileged information in some deployments. This has been
corrected. Deployments using automation to retrieve a node build
details, address details etc should use `/_status/details` instead
and use a valid admin authentication cookie.

Release note (bug fix): Accesses to `/health` using a non-root
authentication token do not hang any more
when a node is currently under load or if a system range is
unavailable.

Co-authored-by: Raphael 'kena' Poss <[email protected]>
@knz
Copy link
Contributor

knz commented Apr 27, 2020

Discussed with @vladdy today:

  1. it's practically ok for CC to say "liveness = process liveness"
  2. however it would be better if also offered a working liveness probe over HTTP
  3. it's OK if the liveness probe takes multiple seconds to respond under heavy CPU load - we should calibrate and then provide some guidance
  4. to minimize the impact of CPU usage we should change the liveness probe to be HTTP only (no TLS)

I am filing a separate issue #48069 to do this last point - ensure that our HTTP endpoint is able to process some API calls (those that are unauthenticated, including liveness) over HTTP without TLS. Once that is done, I'll close this issue.

@joshimhoff
Copy link
Collaborator

@knz: I don't think this issue is important to close but also admission control is almost surely the solution, as then enough CPU can be reserved to always respond to http health checks. I see this is linked from your lovely doc on server team strategy, so I thought I'd note that I don't think the priority of this issue is high. If you want more explanation, I am happy to elaborate.

@jlinder jlinder added the T-server-and-security DB Server & Security label Jun 16, 2021
@knz knz added the A-cc-enablement Pertains to current CC production issues or short-term projects label Jul 29, 2021
@knz
Copy link
Contributor

knz commented Sep 17, 2021

The health and prometheus vars don't need authn nor TLS now.
Also acceptance control has helped reduce CPU starvation.
I don't believe we have more work to do here.

@knz knz closed this as completed Sep 17, 2021
@rafiss
Copy link
Collaborator

rafiss commented Sep 17, 2021

I believe as recently as last month I was involved with customer support cases where our recommendation was to disable the k8s liveness probe because of this issue.

If this issue is really resolved, then should we undo this change? #67080

cc @keith-mcclellan @taroface

@knz
Copy link
Contributor

knz commented Sep 17, 2021

hm can we get some backing of which version was this recommendation for?

the admission control work would be only in v21.2. Previous versions could still be affected

@rafiss
Copy link
Collaborator

rafiss commented Sep 17, 2021

Good point, I was definitely referring to an older version (20.2). It's a long thread, but details are in https://cockroachdb.zendesk.com/agent/tickets/8759

My concern is more about how will we tell people who upgrade to v21.2 that they should add back this liveness probe after we strongly recommended that they disable it? One specific task is to revert #67080 but it seems like it would warrant further communication than just that.

@knz
Copy link
Contributor

knz commented Sep 17, 2021

I'm not sure what would be a good way to go about this. I think what I'd like to propose is to run some load testing on CC clusters (for example using a workload generator) and see if it makes the CC probes trip up. If it doesn't, we can document this and indeed revert that linked PR. How does this sound?

@joshimhoff
Copy link
Collaborator

joshimhoff commented Sep 17, 2021

@sumeerbhola are we confident that under arbitrary workload, on 21.2 cluster, we will reserve enough CPU to respond to health checks? My sense is that admission control is a WIP with respect to the above goal. If we are not confident that under arbitrary workload we will reserve enough CPU to respond to health checks, I would vote we leave this issue open & also not revert #67080.

@knz knz reopened this Sep 20, 2021
@sumeerbhola
Copy link
Collaborator

@sumeerbhola are we confident that under arbitrary workload, on 21.2 cluster, we will reserve enough CPU to respond to health checks?

I read through only part of this long issue. Could you clarify what layers of the code are encountered when processing a health check request: is it only KV and lower, or does it include SQL too? And presumably it only needs to read local store state, yes?
Regarding confidence, I think it is highly likely, but it would need to be experimentally validated by reproducing a scenario which fails health checks with admission control turned off (which is the default for 21.2), and then trying with it turned on.

@joshimhoff
Copy link
Collaborator

Regarding confidence, I think it is highly likely, but it would need to be experimentally validated by reproducing a scenario which fails health checks with admission control turned off (which is the default for 21.2), and then trying with it turned on.

That's very exciting!

@ajwerner
Copy link
Contributor

ajwerner commented Mar 2, 2022

@joshimhoff is it time to close this?

@joshimhoff
Copy link
Collaborator

Let's do it.

I'll note that in a CC context, we don't use a liveness probe today, and I am not eager to change that given the size of our eng backlog. The immediate benefit of doing so would be small, as I have seen very few issues that would be both caught by a liveness probe and mitigated by a process restart.

Relatedly, if someone wants to recommend a liveness probe for on-prem or similar, I'd say someone should do some testing first. As we don't use on in CC, CC hasn't done that testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cc-enablement Pertains to current CC production issues or short-term projects A-kv-server Relating to the KV-level RPC server A-security C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-server-and-security DB Server & Security X-server-triaged-202105
Projects
None yet
Development

No branches or pull requests