-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CockroachDB: Health checks time out under load #44832
Comments
cc @andy-kimball and @nvanbenschoten. |
@DuskEagle, would the liveness probe typically need to do anything but just always return "healthy"? In other words, are there other conditions we might want to check beyond the implicit condition that the process is healthy enough to trivially respond to an HTTP request? It's hard to think of cases where the process is still up, and yet thinks things are so dire that we should kill the entire container. +@tbg, since he's been thinking about liveness in rolling upgrade cases, which seems somewhat related. I'd like us to have very crisp definitions for each term, like "healthy" or "ready" or "available", etc. |
@andy-kimball CockroachDB has the node liveness heartbeat which causes an instance to shut down if the heartbeat times out. This is already performing much of the same function as a liveness probe is intended to. If the node liveness heartbeat didn't exist, I would see the value in the liveness probe. I'm struggling to think of a case where the liveness probe is all that useful given the node liveness heartbeat. I think removing the liveness probe makes sense given that. In terms of definitions for terms, here is how terms are defined in Kubernetes: Liveness: The container is running and is "healthy". If this is false, Kubernetes will restart the container. The definition of "healthy" is application-dependent, but an unhealthy container is in a state where a restart is required. Readiness: The container is able to serve requests. If this is false, Kubernetes will remove the pod from the endpoints of all services that match the pod. In other words, traffic directed at the load balancer for the application will not be routed to this pod. |
This is an interesting point re: node liveness! I think I agree with the conclusion (that CRDB implements a kind of liveness probe and that this means a k8s level liveness probe doesn't add value), though it makes me realize I don't understand node liveness as well as I want to. To fully agree with the conclusion, I want to understand it better. Can someone explain more concretely what it means that "node liveness heartbeat which causes an instance to shut down if the heartbeat times out"? My understanding is that if a node's liveness record is not updated (via a write to a system range) then the REST of the cluster considers it down / it loses leases / etc. I didn't know that a node would crash itself if it starts failing to update its liveness record. This makes sense to me, though also I find it a bit confusing. Couldn't a node fail to update its liveness record NOT because it's having issues but instead because OTHER nodes are having issues? I'd expect that to be the case, since IIUC writing the liveness record is a KV level write and thus requires quorum? This detail (from cockroachdb/docs@a054377) also confuses me: "This is achieved using checks that ensure that each node connected to the cluster is updating its liveness record. This information is shared with the rest of the cluster using an internal gossip protocol." How does gossip come into play exactly? |
I think there's some confusion here. A CRDB node does not kill itself for reasons related to its node liveness record. Nodes will continue running happily if they fail to heartbeat their record. I came here because someone else is complaining about this issue. Joel, could you get some data for one of these humongous 20s latencies? I guess a CPU profile while that is going on, and also a go runtime trace, perhaps some extra logs around here. |
@andreimatei I've started looking into this to get better understanding of this behavior as well. I'll get the data you need. |
we're also having issues with that, please push this! |
I found the cause (see #45020) and will send a PR. |
Awesome! We'd appreciate a backport, as this is causing regular prod issues on CC clusters. |
Hell yeah! We also need this fixed as soon as possible. |
understood. |
Are you sure the authentication explains this? I think people are complaining about the |
|
Sorry, I'm confused. I think I verified in #45018 (comment) that |
I'm not talking about authentication. The key point is that both We want that for |
FWIW in #45119 I have the following:
Note that neither prior to this change nor with it, does |
@vladdy @DuskEagle after discussion with Andrei, we still find it useful to extract a CPU profile when the health endpoint seems to have a high latency. Please provide this to us regardless of progress made on #45119 - thanks. |
I'm a bit lost on what the latest on these timeout issues is (comments here indicate that it was solved, but we discussed it yesterday in KV weekly and it seemed mysterious still). A small recap would be nice. Either way, I just reviewed #45119 and it's how things ought to work. Don't restart the pod until There is no acceptable reason for Going back a few messages to #44832 (comment) about the crisp definitions, they are problematic right now because we use the term "liveness" at different levels, all while basing it mostly on the I commented on some of the internal confusion more concretely in |
I think we certainly need to dig deeper into root cause, but on cloud, we are seeing liveness probe failures when a node is overloaded. Out of CPU, nearly out of memory. Often a customer is doing functional testing and is hitting CRDB hard on purpose as a stress test. Not sure if this is what you imagine when you say "severely, severely overloaded", but I guess Joel's repro instructions at the top of this issue can make concrete how overloaded CRDB is when liveness probes start failing. I think it's a better experience for users if when CRDB is out of resources, serving requests starts taking more time but doesn't lead to total unavailability due to k8s killing the pods. |
I have discussed this with @DuskEagle yesterday. Here's the summary of our conversation:
Conclusion: proposal to decide liveness based on another metric that can be sampled "outside" of the process without suffering from the vagaries of the go scheduler:
|
The recap is that we're at square one - no idea why this endpoint is not responding for "20+ seconds".
Well, hold up one second :). Let's not give up quite that easy. Although in theory one can create arbitrary-length run queues, I don't think we've really seen that yet in CRDB, particularly not across requests coming on different sockets. Case in point: when we moved the liveness heartbeats on a different gRPC connection, they became a lot more reliable under heavy load - basically I don't think we can get them to take over a few secs any more regardless of CPU load (different story for IO contention, but that's not the case for these probes). A thing that we've seen is pathological behavior when the GOMAXPROCS is not configured correctly. That's not the case for CC, is it? |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
45119: server: fix broken /health r=tbg,andreimatei a=knz Informs #44832. Fixes #45020. This patch fixes `/health` by ensuring it does not require authentication, does not perform KV operations and does not expose sensitive details. Reminder, for posterity: - `/_status/details` exposes health, readiness and node details, which is potentially privileged information. This requires an authenticated admin user. It also performs a KV operation for authentication and thus subject to cluster availability. - `/_admin/v1/health` only exposes health and readiness. It is non-privileged and does not require authentication. It does not perform KV operations and is thus not subject to cluster availability. - `/health` is now an alias for `/_admin/v1/health`. (It used to be a non-authenticated alias for `/_status/details` which was both a UX and security bug. See release notes below for details.) - both `/health` and `/_admin/v1/health` accept a boolean flag e.g. via `?ready=1`. When `ready` is specified, a HTTP error is returned if the node is up but not able to accept client connections, or not live, or shutting down. When `ready` is *not* specified, a HTTP success is always returned if the node is up, even when it cannot accept client connections or is not live. Release justification: Category 3: Fixes for high-priority or high-severity bugs in existing functionality Release note (security update): The non-authenticated `/health` HTTP endpoint was previously exposing the private IP address of the node, which can be privileged information in some deployments. This has been corrected. Deployments using automation to retrieve a node build details, address details etc should use `/_status/details` instead and use a valid admin authentication cookie. Release note (bug fix): Accesses to `/health` using a non-root authentication token do not hang any more when a node is currently under load or if a system range is unavailable. Co-authored-by: Raphael 'kena' Poss <[email protected]>
Discussed with @vladdy today:
I am filing a separate issue #48069 to do this last point - ensure that our HTTP endpoint is able to process some API calls (those that are unauthenticated, including liveness) over HTTP without TLS. Once that is done, I'll close this issue. |
@knz: I don't think this issue is important to close but also admission control is almost surely the solution, as then enough CPU can be reserved to always respond to http health checks. I see this is linked from your lovely doc on server team strategy, so I thought I'd note that I don't think the priority of this issue is high. If you want more explanation, I am happy to elaborate. |
The health and prometheus vars don't need authn nor TLS now. |
I believe as recently as last month I was involved with customer support cases where our recommendation was to disable the k8s liveness probe because of this issue. If this issue is really resolved, then should we undo this change? #67080 |
hm can we get some backing of which version was this recommendation for? the admission control work would be only in v21.2. Previous versions could still be affected |
Good point, I was definitely referring to an older version (20.2). It's a long thread, but details are in https://cockroachdb.zendesk.com/agent/tickets/8759 My concern is more about how will we tell people who upgrade to v21.2 that they should add back this liveness probe after we strongly recommended that they disable it? One specific task is to revert #67080 but it seems like it would warrant further communication than just that. |
I'm not sure what would be a good way to go about this. I think what I'd like to propose is to run some load testing on CC clusters (for example using a workload generator) and see if it makes the CC probes trip up. If it doesn't, we can document this and indeed revert that linked PR. How does this sound? |
@sumeerbhola are we confident that under arbitrary workload, on |
I read through only part of this long issue. Could you clarify what layers of the code are encountered when processing a health check request: is it only KV and lower, or does it include SQL too? And presumably it only needs to read local store state, yes? |
That's very exciting! |
@joshimhoff is it time to close this? |
Let's do it. I'll note that in a CC context, we don't use a liveness probe today, and I am not eager to change that given the size of our eng backlog. The immediate benefit of doing so would be small, as I have seen very few issues that would be both caught by a liveness probe and mitigated by a process restart. Relatedly, if someone wants to recommend a liveness probe for on-prem or similar, I'd say someone should do some testing first. As we don't use on in CC, CC hasn't done that testing. |
When a CockroachDB node is near 100% CPU usage, requests to any of the health check endpoints (
/health
,/health?ready=1
or/_admin/v1/health
) will sometimes hang. In our example Kubernetes manifests, we have a timeout on our health checks of 1 second, but I have observed the endpoint fail to respond for 20+ seconds. The node is still otherwise up and able to process SQL queries.This is a major problem when using an HTTP request to one of these endpoints as a Kubernetes liveness probe. When Kubernetes detects multiple liveness probes fail in a row, it believes that the CockroachDB container has crashed and will restart the container. This results downtime on a single-node cluster. On a three-node cluster, all three liveness probes can fail within a short period of time, also resulting in downtime.
Up until now, we have been using HTTP requests to the
/health
endpoint as our liveness probe. This is the behaviour specified within our example Kubernetes manifests.To reproduce this behaviour, run
cockroach workload init tpcc --warehouses=100
on a single-node GCP n1-standard-2 cluster (exact number of warehouses for a perfect repro TBD). On a Kubernetes cluster, our default liveness probe will fail, resulting in the CockroachDB container being restarted and theworkload init
command failing. If instead, you change the liveness probe to be a TCP check against either the HTTP or GRPC port, or remove the liveness probe entirely, the liveness probe will not fail and theworkload init
command will succeed. Outside of a Kubernetes environment, you can also reproduce this issue by repeatedly sending HTTP requests to a health check endpoint and measuring the response time.Ideally, what we would like from CockroachDB are two health check endpoints:
One for use as the liveness probe, which always responds healthy unless the application needs to be restarted.
One for use as the readiness probe, which responds healthy if and only if that CRDB pod is able to receive requests. We currently use
/health?ready=1
as our readiness probe endpoint. It will also time out under load, but that is less severe than a liveness probe timing out. Kubernetes will stop sending traffic directed to the CockroachDB service to any pod whose readiness probe is failing until the readiness probe recovers.In the meantime, we have removed the liveness probes from our Kubernetes deployments of CockroachDB, which has helped keep CockroachDB running under load (Kubernetes will still restart the CockroachDB container if CockroachDB actually crashes). We should consider recommending this for other customers deploying CockroachDB on top of Kubernetes.
Epic: CRDB-549
Jira issue: CRDB-5200
The text was updated successfully, but these errors were encountered: