You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When running a workload on a 3 node cluster and one node is stopped, when you try to restart the node, it takes quite a long time to restart, to the point that I just assumed the node was dead and was not going to restart.
Log entry from when the node was killed reads:
ip-10-12-41-82> W190628 00:58:28.137760 969951 vendor/google.golang.org/grpc/clientconn.go:1304 grpc: addrConn.createTransport failed to connect to {ip-10-12-35-255:26257 0 <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
I restarted the node after it was marked as dead in the Admin UI and after 10+ minutes it was not seen as live on the Admin UI and logs from the cluster read:
And checking in the Admin UI the node is up and running again.
Steps to Reproduce
Cockroach Version 19.1.1
Using AWS vCPU size c5d.4xlarge
Import TPCC 1K
Run TPCC 1k on just the first two nodes:
roachprod run $CLUSTER:4 "./workload run tpcc --ramp=5m --warehouses=1000 --active-warehouses=1000 --duration=10m --scatter {pgurl:1-2}"
Kill the third node:
`roachprod stop $CLUSTER:3'
I had made a few cluster settings changes as per the issue that first reported this, they are as follows:
SET CLUSTER SETTING server.time_until_store_dead='3m';
SET CLUSTER SETTING kv.snapshot_rebalance.max_rate='32MiB';
SET CLUSTER SETTING kv.snapshot_recovery.max_rate='32MiB';
Reasonably sure this is just another rediscovery of #37906. We're trying to get a mitigation into 19.1.3, so far it looks like we'll succeed. The main PR is #38484 and will hopefully go a long way already
Note that on a 3 node cluster the <5 vs >5 minute distinction doesn't matter because there's nowhere else for the replicas to go, so they stay on the dead node indefinitely.
Describe the problem
When running a workload on a 3 node cluster and one node is stopped, when you try to restart the node, it takes quite a long time to restart, to the point that I just assumed the node was dead and was not going to restart.
Log entry from when the node was killed reads:
I restarted the node after it was marked as dead in the Admin UI and after 10+ minutes it was not seen as live on the Admin UI and logs from the cluster read:
Later on the logs read
The node never successfully restarted according to the logs at this point.
However ~12 minutes later on in the logs:
And checking in the Admin UI the node is up and running again.
Steps to Reproduce
Cockroach Version 19.1.1
Using AWS vCPU size c5d.4xlarge
Import TPCC 1K
Run TPCC 1k on just the first two nodes:
roachprod run $CLUSTER:4 "./workload run tpcc --ramp=5m --warehouses=1000 --active-warehouses=1000 --duration=10m --scatter {pgurl:1-2}"
Kill the third node:
`roachprod stop $CLUSTER:3'
I had made a few cluster settings changes as per the issue that first reported this, they are as follows:
Here is the log file from node 3.
The text was updated successfully, but these errors were encountered: