-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stability problems with restore #15341
Comments
We've found very consistently in RESTORE testing that missing node liveness heartbeats can cause death spirals. There was a recent round of rate limiting added (5 in-flight ImportRequests at any given time) to try and combat this on the assumption that it was caused by distsender/grpc contention. An additional data point is that a 10 node cluster of the same hardware seemed to be doing fine on node liveness during RESTORE. I'm going to rerun it to see if something in the code has changed to cause this or if the rate limiting isn't working as well on the 15 node cluster. |
This indeed seems to still be happening with the 10 node cluster, but to a lesser extent. There was a very strong correlation between slow liveness heartbeats, rocksdb batches taking more than 500ms to commit, and disk I/O as reported by gce. It really does seem we're running out of disk bandwidth |
We should get more aggressive with retries in node liveness. In particular, if we notice an ambiguous result error. But some of these timeouts are just insane and no matter what we do, we're not going to eliminate the problem of node liveness failing (and death spirals starting) if we can't service the node liveness requests within a second or two. I'll send a PR for immediate retries on ambiguous result errors. |
Allowing only 1 ImportRequest to run at a time recipient-side (it was 5) seems to have helped a lot. There have been very few slow heartbeats (one node did have a brief bad period), fewer (but still some) rocksdb batches taking > 500 ms. The overall restore is running maybe 10-15% slower I'll let this run overnight and double back on the 15 node cluster tomorrow. Given that we still have some batches taking more than 500ms, it might be also be worth rate limiting the number of concurrent WriteBatch requests (note to self: this would have to be a different rate limit than Import to avoid deadlock) |
Yikes, a 17-second rocksdb sync is bad news. There's not much we can do if the disk is doing that to us; it's going to be pretty disruptive. Looks like we may need to prioritize #7807 in the 1.1 timeframe. Which cluster was this? We have node_exporter stats in grafana that include disk I/O. |
FWIW, we only allow 1 snapshot to be applied at a time. Do the ImportRequests perform the download? I think the important part to limit is the WriteBatchRequests. |
A terraform cluster in our shared GCE project. I doubt the node exporter was running on it. We still should be able to access GCE's metrics for a while, if needed. |
Many of our RESTORE stability problems seem to be starting with overloaded disks, which caused contention in RocksDB, which slowed down heartbeats, which caused mass lease transfers. WriteBatch and the resuting write amplification is the largest part of this, so add a rate limit directly for that. The WriteBatch limiting also lets us be less conservative about the client-side Import request limiting in Restore. For cockroachdb#15341.
Many of our RESTORE stability problems seem to be starting with overloaded disks, which caused contention in RocksDB, which slowed down heartbeats, which caused mass lease transfers. WriteBatch and the resuting write amplification is the largest part of this, so add a rate limit directly for that. The WriteBatch limiting also lets us be less conservative about the client-side Import request limiting in Restore. For cockroachdb#15341.
@danhhz is this issue still tracking any remaining work for 1.0? |
I was just about to close this. Nope |
@danhhz ran a 15-node 2TB restore test on GCE today, and it was suffering from some major problems:
At the heart of these problems is that nodes are failing to heartbeat their node liveness records quickly enough, which leads to lost leases and performance problems. Judging by the logs on these nodes, there are two main causes of the slow heartbeats: very slow rocksdb commits, and ambiguous result errors.
All of the earliest "slow heartbeat" log messages on node 1 of the cluster are accompanied by a similarly slow rocksdb commit. The first error comes from a rocksdb commit that took 17.85 seconds to write a single key (
batch [1/51/5] commit took 17.853569878s
), and there are many more multi-second commits for very small batches in the logs (along with a number of multi-second commits of 1MB batches).Our hypothesis at this point is that our disk I/O is being throttled by GCE. @danhhz is going to run another experiment to test that hypothesis. It looks feasible though, judging by our usage in the graphs below and the documented limits at https://cloud.google.com/compute/docs/disks/performance
As for the ambiguous result errors, I'm not currently sure of whether they're expected or not. We're seeing two different types of ambiguous results -- those with "error=retry", and those with "error=unexpected value":
@spencerkimball, can you weigh in on these as a resident expert on both node liveness and ambiguous results?
The text was updated successfully, but these errors were encountered: