Node takes a long time to restart after killing during workload. #38531

roncrdb · 2019-06-28T01:51:56Z

Describe the problem

When running a workload on a 3 node cluster and one node is stopped, when you try to restart the node, it takes quite a long time to restart, to the point that I just assumed the node was dead and was not going to restart.

Log entry from when the node was killed reads:

ip-10-12-41-82> W190628 00:58:28.137760 969951 vendor/google.golang.org/grpc/clientconn.go:1304  grpc: addrConn.createTransport failed to connect to {ip-10-12-35-255:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...

I restarted the node after it was marked as dead in the Admin UI and after 10+ minutes it was not seen as live on the Admin UI and logs from the cluster read:

./cockroach.ip-10-12-35-255.ubuntu.2019-06-28T01_02_08Z.006376.log-W190628 01:12:53.584092 378 storage/node_liveness.go:463  [n3,hb] failed node liveness heartbeat: operation "node liveness heartbeat" timed out after 4.5s
./cockroach.ip-10-12-35-255.ubuntu.2019-06-28T01_02_08Z.006376.log-I190628 01:12:53.799104 2636 storage/replica_raftstorage.go:814  [n3,s3,r164/3:/Table/57/1/"\xb{4#\x…-8\x1…}] applying Raft snapshot at index 8025 (id=528b3cf8, encoded size=39870286, 153 rocksdb batches, 0 log entries)
./cockroach.ip-10-12-35-255.ubuntu.2019-06-28T01_02_08Z.006376.log:I190628 01:12:54.242868 2636 storage/replica_raftstorage.go:820  [n3,s3,r164/3:/Table/57/1/"\xb{4#\x…-8\x1…}] applied Raft snapshot in 444ms [clear=1ms batch=319ms entries=0ms commit=123ms]

Later on the logs read

/cockroach.ip-10-12-35-255.ubuntu.2019-06-28T01_02_08Z.006376.log-I190628 01:22:52.113785 4651 storage/replica_raftstorage.go:820  [n3,s3,r1843/3:/Table/58/1/89{1/483…-2/455…}] applied Raft snapshot in 235ms [clear=1ms batch=151ms entries=0ms commit=82ms]
./cockroach.ip-10-12-35-255.ubuntu.2019-06-28T01_02_08Z.006376.log-I190628 01:22:53.186338 4600 storage/replica_raftstorage.go:814  [n3,s3,r861/3:/Table/58/1/{5/23092-6/27718}] applying Raft snapshot at index 1092 (id=3a3dd815, encoded size=35962848, 138 rocksdb batches, 0 log entries)
./cockroach.ip-10-12-35-255.ubuntu.2019-06-28T01_02_08Z.006376.log-I190628 01:22:53.398927 4600 storage/replica_raftstorage.go:820  [n3,s3,r861/3:/Table/58/1/{5/23092-6/27718}] applied Raft snapshot in 213ms [clear=2ms batch=146ms entries=0ms commit=63ms]
./cockroach.ip-10-12-35-255.ubuntu.2019-06-28T01_02_08Z.006376.log:I190628 01:22:54.144531 190 server/server_update.go:64  [n3] failed attempt to upgrade cluster version, error: node 3 not running (DEAD), cannot determine version
./cockroach.ip-10-12-35-255.ubuntu.2019-06-28T01_02_08Z.006376.log:I190628 01:22:54.841340 4677 storage/replica_raftstorage.go:814  [n3,s3,r57/3:/Table/60/1/6{40/8/-…-77/4/-…}] applying Raft snapshot at index 20563 (id=bc2397dc, encoded size=48447484, 185 rocksdb batches, 0 log entries)

The node never successfully restarted according to the logs at this point.

However ~12 minutes later on in the logs:

./cockroach.ip-10-12-35-255.ubuntu.2019-06-28T01_02_08Z.006376.log-I190628 01:34:52.032261 7346 storage/replica_raftstorage.go:820  [n3,s3,r1172/3:/Table/58/1/36{0/774…-1/814…}] applied Raft snapshot in 233ms [clear=2ms batch=163ms entries=0ms commit=67ms]
./cockroach.ip-10-12-35-255.ubuntu.2019-06-28T01_02_08Z.006376.log-I190628 01:34:53.081063 7437 storage/replica_raftstorage.go:814  [n3,s3,r311/3:/Table/54/1/10{1/8/1…-3/8/4…}] applying Raft snapshot at index 1283 (id=2fa342c5, encoded size=35016108, 134 rocksdb batches, 0 log entries)
./cockroach.ip-10-12-35-255.ubuntu.2019-06-28T01_02_08Z.006376.log-I190628 01:34:53.275408 7437 storage/replica_raftstorage.go:820  [n3,s3,r311/3:/Table/54/1/10{1/8/1…-3/8/4…}] applied Raft snapshot in 194ms [clear=1ms batch=140ms entries=0ms commit=52ms]
./cockroach.ip-10-12-35-255.ubuntu.2019-06-28T01_02_08Z.006376.log:I190628 01:34:54.343643 7467 storage/replica_raftstorage.go:814  [n3,s3,r1031/3:/Table/58/1/22{7/354…-8/397…}] applying Raft snapshot at index 1131 (id=9df2511a, encoded size=35934066, 137 rocksdb batches, 0 log entries)
./cockroach.ip-10-12-35-255.ubuntu.2019-06-28T01_02_08Z.006376.log:I190628 01:34:54.590214 7467 storage/replica_raftstorage.go:820  [n3,s3,r1031/3:/Table/58/1/22{7/354…-8/397…}] applied Raft snapshot in 247ms [clear=2ms batch=183ms entries=0ms commit=60ms]

And checking in the Admin UI the node is up and running again.

Steps to Reproduce
Cockroach Version 19.1.1
Using AWS vCPU size c5d.4xlarge
Import TPCC 1K
Run TPCC 1k on just the first two nodes:

roachprod run $CLUSTER:4 "./workload run tpcc --ramp=5m --warehouses=1000 --active-warehouses=1000 --duration=10m --scatter {pgurl:1-2}"

Kill the third node:

`roachprod stop $CLUSTER:3'

I had made a few cluster settings changes as per the issue that first reported this, they are as follows:

SET CLUSTER SETTING server.time_until_store_dead='3m';
SET CLUSTER SETTING kv.snapshot_rebalance.max_rate='32MiB';
SET CLUSTER SETTING kv.snapshot_recovery.max_rate='32MiB';

Here is the log file from node 3.

The text was updated successfully, but these errors were encountered:

tbg · 2019-06-28T14:35:42Z

Reasonably sure this is just another rediscovery of #37906. We're trying to get a mitigation into 19.1.3, so far it looks like we'll succeed. The main PR is #38484 and will hopefully go a long way already

Note that on a 3 node cluster the <5 vs >5 minute distinction doesn't matter because there's nowhere else for the replicas to go, so they stay on the dead node indefinitely.

roncrdb added the C-investigation Further steps needed to qualify. C-label will change. label Jun 28, 2019

roncrdb assigned awoods187 Jun 28, 2019

tbg closed this as completed Jun 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node takes a long time to restart after killing during workload. #38531

Node takes a long time to restart after killing during workload. #38531

roncrdb commented Jun 28, 2019

tbg commented Jun 28, 2019

Node takes a long time to restart after killing during workload. #38531

Node takes a long time to restart after killing during workload. #38531

Comments

roncrdb commented Jun 28, 2019

tbg commented Jun 28, 2019