Stability problems with restore #15341

a-robinson · 2017-04-25T20:21:48Z

@danhhz ran a 15-node 2TB restore test on GCE today, and it was suffering from some major problems:

At the heart of these problems is that nodes are failing to heartbeat their node liveness records quickly enough, which leads to lost leases and performance problems. Judging by the logs on these nodes, there are two main causes of the slow heartbeats: very slow rocksdb commits, and ambiguous result errors.

All of the earliest "slow heartbeat" log messages on node 1 of the cluster are accompanied by a similarly slow rocksdb commit. The first error comes from a rocksdb commit that took 17.85 seconds to write a single key (batch [1/51/5] commit took 17.853569878s), and there are many more multi-second commits for very small batches in the logs (along with a number of multi-second commits of 1MB batches).

Our hypothesis at this point is that our disk I/O is being throttled by GCE. @danhhz is going to run another experiment to test that hypothesis. It looks feasible though, judging by our usage in the graphs below and the documented limits at https://cloud.google.com/compute/docs/disks/performance

As for the ambiguous result errors, I'm not currently sure of whether they're expected or not. We're seeing two different types of ambiguous results -- those with "error=retry", and those with "error=unexpected value":

W170425 15:51:05.079154 92 storage/node_liveness.go:289  [n1,hb] slow heartbeat took 1.0s
W170425 15:51:05.079206 92 storage/node_liveness.go:252  [n1,hb] failed node liveness heartbeat: result is ambiguous (error=retry txn "unnamed" id=871aa64c key=/System/NodeLiveness/1 rw=false pri=0.00378666 iso=SERIALIZABLE stat=PENDING epo=0 ts=1493135464.556604843,2 orig=1493135464.075142305,0 max=1493135464.576046129,0 wto=false rop=false seq=2, pending RPCs=1)
W170425 15:52:02.868877 92 storage/node_liveness.go:289  [n1,hb] slow heartbeat took 9.3s
W170425 15:52:02.868912 92 storage/node_liveness.go:252  [n1,hb] failed node liveness heartbeat: result is ambiguous (error=retry txn "unnamed" id=cb95aece key=/System/NodeLiveness/1 rw=false pri=0.00595273 iso=SERIALIZABLE stat=PENDING epo=0 ts=1493135521.257714751,2 orig=1493135513.568350121,0 max=1493135514.568350121,0 wto=false rop=false seq=2, pending RPCs=2)
W170425 15:52:04.880041 92 storage/node_liveness.go:289  [n1,hb] slow heartbeat took 2.0s
W170425 15:52:04.880070 92 storage/node_liveness.go:252  [n1,hb] failed node liveness heartbeat: result is ambiguous (error=unexpected value: raw_bytes:"\261\3113\327\003\010\001\020\003\032\014\010\370\307\340\274\377\332\253\334\024\020\000" timestamp:<wall_time:1493135522868022939 logical:0 > , pending RPCs=1)
W170425 15:52:13.339812 92 storage/node_liveness.go:289  [n1,hb] slow heartbeat took 1.8s
W170425 15:52:13.339844 92 storage/node_liveness.go:252  [n1,hb] failed node liveness heartbeat: result is ambiguous (error=unexpected value: raw_bytes:"\261\3113\327\003\010\001\020\003\032\014\010\370\307\340\274\377\332\253\334\024\020\000" timestamp:<wall_time:1493135522868022939 logical:0 > , pending RPCs=1)

@spencerkimball, can you weigh in on these as a resident expert on both node liveness and ambiguous results?

The text was updated successfully, but these errors were encountered:

danhhz · 2017-04-25T20:29:24Z

We've found very consistently in RESTORE testing that missing node liveness heartbeats can cause death spirals. There was a recent round of rate limiting added (5 in-flight ImportRequests at any given time) to try and combat this on the assumption that it was caused by distsender/grpc contention.

An additional data point is that a 10 node cluster of the same hardware seemed to be doing fine on node liveness during RESTORE. I'm going to rerun it to see if something in the code has changed to cause this or if the rate limiting isn't working as well on the 15 node cluster.

danhhz · 2017-04-25T20:46:27Z

This indeed seems to still be happening with the 10 node cluster, but to a lesser extent. There was a very strong correlation between slow liveness heartbeats, rocksdb batches taking more than 500ms to commit, and disk I/O as reported by gce. It really does seem we're running out of disk bandwidth

spencerkimball · 2017-04-25T20:51:46Z

We should get more aggressive with retries in node liveness. In particular, if we notice an ambiguous result error. But some of these timeouts are just insane and no matter what we do, we're not going to eliminate the problem of node liveness failing (and death spirals starting) if we can't service the node liveness requests within a second or two. I'll send a PR for immediate retries on ambiguous result errors.

danhhz · 2017-04-25T22:11:12Z

Allowing only 1 ImportRequest to run at a time recipient-side (it was 5) seems to have helped a lot. There have been very few slow heartbeats (one node did have a brief bad period), fewer (but still some) rocksdb batches taking > 500 ms. The overall restore is running maybe 10-15% slower

I'll let this run overnight and double back on the 15 node cluster tomorrow. Given that we still have some batches taking more than 500ms, it might be also be worth rate limiting the number of concurrent WriteBatch requests (note to self: this would have to be a different rate limit than Import to avoid deadlock)

bdarnell · 2017-04-25T22:13:00Z

Yikes, a 17-second rocksdb sync is bad news. There's not much we can do if the disk is doing that to us; it's going to be pretty disruptive. Looks like we may need to prioritize #7807 in the 1.1 timeframe.

Which cluster was this? We have node_exporter stats in grafana that include disk I/O.

petermattis · 2017-04-25T22:21:41Z

FWIW, we only allow 1 snapshot to be applied at a time. Do the ImportRequests perform the download? I think the important part to limit is the WriteBatchRequests.

a-robinson · 2017-04-25T22:38:51Z

Which cluster was this? We have node_exporter stats in grafana that include disk I/O.

A terraform cluster in our shared GCE project. I doubt the node exporter was running on it. We still should be able to access GCE's metrics for a while, if needed.

Addresses #15341

Many of our RESTORE stability problems seem to be starting with overloaded disks, which caused contention in RocksDB, which slowed down heartbeats, which caused mass lease transfers. WriteBatch and the resuting write amplification is the largest part of this, so add a rate limit directly for that. The WriteBatch limiting also lets us be less conservative about the client-side Import request limiting in Restore. For cockroachdb#15341.

a-robinson · 2017-05-01T19:50:52Z

@danhhz is this issue still tracking any remaining work for 1.0?

danhhz · 2017-05-01T19:52:16Z

I was just about to close this. Nope

a-robinson added this to the 1.0 milestone Apr 25, 2017

a-robinson assigned danhhz Apr 25, 2017

a-robinson mentioned this issue Apr 25, 2017

debug: The problemranges debug page hangs if nodes aren't responsive #15342

Closed

spencerkimball added a commit that referenced this issue Apr 26, 2017

storage: retry ambiguous errors on node liveness heartbeats

58ffff6

Addresses #15341

spencerkimball mentioned this issue Apr 26, 2017

storage: retry ambiguous errors on node liveness heartbeats #15362

Merged

spencerkimball added a commit that referenced this issue Apr 26, 2017

storage: retry ambiguous errors on node liveness heartbeats

e93d955

Addresses #15341

spencerkimball added a commit that referenced this issue Apr 26, 2017

storage: retry ambiguous errors on node liveness heartbeats

c58f864

Addresses #15341

a-robinson mentioned this issue Apr 27, 2017

storage: Expose a rocksdb commit latency metric #15421

Closed

danhhz mentioned this issue Apr 27, 2017

storageccl: rate limit the WriteBatch command #15436

Closed

danhhz mentioned this issue Apr 28, 2017

restore: 2tb dataset doesn't finish #14792

Closed

danhhz closed this as completed May 1, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stability problems with restore #15341

Stability problems with restore #15341

a-robinson commented Apr 25, 2017

danhhz commented Apr 25, 2017

danhhz commented Apr 25, 2017

spencerkimball commented Apr 25, 2017

danhhz commented Apr 25, 2017

bdarnell commented Apr 25, 2017

petermattis commented Apr 25, 2017

a-robinson commented Apr 25, 2017

a-robinson commented May 1, 2017

danhhz commented May 1, 2017

Stability problems with restore #15341

Stability problems with restore #15341

Comments

a-robinson commented Apr 25, 2017

danhhz commented Apr 25, 2017

danhhz commented Apr 25, 2017

spencerkimball commented Apr 25, 2017

danhhz commented Apr 25, 2017

bdarnell commented Apr 25, 2017

petermattis commented Apr 25, 2017

a-robinson commented Apr 25, 2017

a-robinson commented May 1, 2017

danhhz commented May 1, 2017