Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stability problems with restore #15341

Closed
a-robinson opened this issue Apr 25, 2017 · 9 comments
Closed

Stability problems with restore #15341

a-robinson opened this issue Apr 25, 2017 · 9 comments
Assignees
Milestone

Comments

@a-robinson
Copy link
Contributor

@danhhz ran a 15-node 2TB restore test on GCE today, and it was suffering from some major problems:

screen shot 2017-04-25 at 3 32 22 pm

screen shot 2017-04-25 at 4 06 16 pm

screen shot 2017-04-25 at 4 06 05 pm

screen shot 2017-04-25 at 3 55 19 pm

At the heart of these problems is that nodes are failing to heartbeat their node liveness records quickly enough, which leads to lost leases and performance problems. Judging by the logs on these nodes, there are two main causes of the slow heartbeats: very slow rocksdb commits, and ambiguous result errors.

All of the earliest "slow heartbeat" log messages on node 1 of the cluster are accompanied by a similarly slow rocksdb commit. The first error comes from a rocksdb commit that took 17.85 seconds to write a single key (batch [1/51/5] commit took 17.853569878s), and there are many more multi-second commits for very small batches in the logs (along with a number of multi-second commits of 1MB batches).

Our hypothesis at this point is that our disk I/O is being throttled by GCE. @danhhz is going to run another experiment to test that hypothesis. It looks feasible though, judging by our usage in the graphs below and the documented limits at https://cloud.google.com/compute/docs/disks/performance

screen shot 2017-04-25 at 3 44 42 pm

screen shot 2017-04-25 at 3 57 57 pm

As for the ambiguous result errors, I'm not currently sure of whether they're expected or not. We're seeing two different types of ambiguous results -- those with "error=retry", and those with "error=unexpected value":

W170425 15:51:05.079154 92 storage/node_liveness.go:289  [n1,hb] slow heartbeat took 1.0s
W170425 15:51:05.079206 92 storage/node_liveness.go:252  [n1,hb] failed node liveness heartbeat: result is ambiguous (error=retry txn "unnamed" id=871aa64c key=/System/NodeLiveness/1 rw=false pri=0.00378666 iso=SERIALIZABLE stat=PENDING epo=0 ts=1493135464.556604843,2 orig=1493135464.075142305,0 max=1493135464.576046129,0 wto=false rop=false seq=2, pending RPCs=1)
W170425 15:52:02.868877 92 storage/node_liveness.go:289  [n1,hb] slow heartbeat took 9.3s
W170425 15:52:02.868912 92 storage/node_liveness.go:252  [n1,hb] failed node liveness heartbeat: result is ambiguous (error=retry txn "unnamed" id=cb95aece key=/System/NodeLiveness/1 rw=false pri=0.00595273 iso=SERIALIZABLE stat=PENDING epo=0 ts=1493135521.257714751,2 orig=1493135513.568350121,0 max=1493135514.568350121,0 wto=false rop=false seq=2, pending RPCs=2)
W170425 15:52:04.880041 92 storage/node_liveness.go:289  [n1,hb] slow heartbeat took 2.0s
W170425 15:52:04.880070 92 storage/node_liveness.go:252  [n1,hb] failed node liveness heartbeat: result is ambiguous (error=unexpected value: raw_bytes:"\261\3113\327\003\010\001\020\003\032\014\010\370\307\340\274\377\332\253\334\024\020\000" timestamp:<wall_time:1493135522868022939 logical:0 > , pending RPCs=1)
W170425 15:52:13.339812 92 storage/node_liveness.go:289  [n1,hb] slow heartbeat took 1.8s
W170425 15:52:13.339844 92 storage/node_liveness.go:252  [n1,hb] failed node liveness heartbeat: result is ambiguous (error=unexpected value: raw_bytes:"\261\3113\327\003\010\001\020\003\032\014\010\370\307\340\274\377\332\253\334\024\020\000" timestamp:<wall_time:1493135522868022939 logical:0 > , pending RPCs=1)

@spencerkimball, can you weigh in on these as a resident expert on both node liveness and ambiguous results?

@danhhz
Copy link
Contributor

danhhz commented Apr 25, 2017

We've found very consistently in RESTORE testing that missing node liveness heartbeats can cause death spirals. There was a recent round of rate limiting added (5 in-flight ImportRequests at any given time) to try and combat this on the assumption that it was caused by distsender/grpc contention.

An additional data point is that a 10 node cluster of the same hardware seemed to be doing fine on node liveness during RESTORE. I'm going to rerun it to see if something in the code has changed to cause this or if the rate limiting isn't working as well on the 15 node cluster.

@danhhz
Copy link
Contributor

danhhz commented Apr 25, 2017

This indeed seems to still be happening with the 10 node cluster, but to a lesser extent. There was a very strong correlation between slow liveness heartbeats, rocksdb batches taking more than 500ms to commit, and disk I/O as reported by gce. It really does seem we're running out of disk bandwidth

@spencerkimball
Copy link
Member

We should get more aggressive with retries in node liveness. In particular, if we notice an ambiguous result error. But some of these timeouts are just insane and no matter what we do, we're not going to eliminate the problem of node liveness failing (and death spirals starting) if we can't service the node liveness requests within a second or two. I'll send a PR for immediate retries on ambiguous result errors.

@danhhz
Copy link
Contributor

danhhz commented Apr 25, 2017

Allowing only 1 ImportRequest to run at a time recipient-side (it was 5) seems to have helped a lot. There have been very few slow heartbeats (one node did have a brief bad period), fewer (but still some) rocksdb batches taking > 500 ms. The overall restore is running maybe 10-15% slower

I'll let this run overnight and double back on the 15 node cluster tomorrow. Given that we still have some batches taking more than 500ms, it might be also be worth rate limiting the number of concurrent WriteBatch requests (note to self: this would have to be a different rate limit than Import to avoid deadlock)

@bdarnell
Copy link
Contributor

Yikes, a 17-second rocksdb sync is bad news. There's not much we can do if the disk is doing that to us; it's going to be pretty disruptive. Looks like we may need to prioritize #7807 in the 1.1 timeframe.

Which cluster was this? We have node_exporter stats in grafana that include disk I/O.

@petermattis
Copy link
Collaborator

FWIW, we only allow 1 snapshot to be applied at a time. Do the ImportRequests perform the download? I think the important part to limit is the WriteBatchRequests.

@a-robinson
Copy link
Contributor Author

Which cluster was this? We have node_exporter stats in grafana that include disk I/O.

A terraform cluster in our shared GCE project. I doubt the node exporter was running on it. We still should be able to access GCE's metrics for a while, if needed.

danhhz added a commit to danhhz/cockroach that referenced this issue Apr 27, 2017
Many of our RESTORE stability problems seem to be starting with
overloaded disks, which caused contention in RocksDB, which slowed down
heartbeats, which caused mass lease transfers. WriteBatch and the
resuting write amplification is the largest part of this, so add a rate
limit directly for that.

The WriteBatch limiting also lets us be less conservative about the
client-side Import request limiting in Restore.

For cockroachdb#15341.
danhhz added a commit to danhhz/cockroach that referenced this issue Apr 28, 2017
Many of our RESTORE stability problems seem to be starting with
overloaded disks, which caused contention in RocksDB, which slowed down
heartbeats, which caused mass lease transfers. WriteBatch and the
resuting write amplification is the largest part of this, so add a rate
limit directly for that.

The WriteBatch limiting also lets us be less conservative about the
client-side Import request limiting in Restore.

For cockroachdb#15341.
@a-robinson
Copy link
Contributor Author

@danhhz is this issue still tracking any remaining work for 1.0?

@danhhz
Copy link
Contributor

danhhz commented May 1, 2017

I was just about to close this. Nope

@danhhz danhhz closed this as completed May 1, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants