Cluster stops working after a couple of minutes #53609

luiscosio · 2020-08-28T20:23:48Z

Describe the problem

After having a cluster running for a couple of minutes, the three nodes status go to suspect, and then the cluster stops working. Here is the error:

W200828 20:17:47.799579 1649 kv/kvserver/node_liveness.go:488  [n1,liveness-hb] failed node liveness heartbeat: operation "node liveness heartbeat" timed out after 4.5s
(1) operation "node liveness heartbeat" timed out after 4.5s
Wraps: (2) context deadline exceeded
Error types: (1) *contextutil.TimeoutError (2) context.deadlineExceededError
W200828 20:17:49.954217 595411 kv/kvclient/kvcoord/dist_sender.go:1493  [n1,replicaGC,s1,r47699/1:/Table/68/3/"8pm98qp8tpxfs6"/…] have been waiting 60.00s (1 attempts) for RPC to r1:/{Min-System} [(n1,s1):1, (n2,s2):2, (n5,s5):5, next=6, gen=1?]: aborted during DistSender.Send: context deadline exceeded
W200828 20:17:49.954271 595411 kv/kvclient/kvcoord/dist_sender.go:1496  [n1,replicaGC,s1,r47699/1:/Table/68/3/"8pm98qp8tpxfs6"/…] slow RPC finished after 60.00s (1 attempts)
E200828 20:17:49.954339 595411 kv/kvserver/queue.go:1090  [n1,replicaGC,s1,r47699/1:/Table/68/3/"8pm98qp8tpxfs6"/…] operation "replicaGC queue process replica 47699" timed out after 1m0s: aborted during DistSender.Send: context deadline exceeded
W200828 20:17:51.053006 596174 kv/kvserver/replica_range_lease.go:1034  [n1,consistencyChecker,s1,r48718/1:/Table/68/1/"z49vqv682sqqpd/m…] have been waiting 1m0s attempting to acquire lease
I200828 20:17:51.053086 596174 kv/kvserver/replica_range_lease.go:1039  [n1,consistencyChecker,s1,r48718/1:/Table/68/1/"z49vqv682sqqpd/m…] slow lease acquisition finished after 1m0.000507997s with error [NotLeaseHolderError] r48718: replica (n1,s1):1 not lease holder; lease holder unknown after 1 attempts
W200828 20:17:51.198190 592273 kv/kvserver/replica_range_lease.go:1034  [n1,replicate,s1,r38472/1:/Table/68/1/"z49vqv682sqqpd/m…] have been waiting 1m0s attempting to acquire lease
I200828 20:17:51.198230 592273 kv/kvserver/replica_range_lease.go:1039  [n1,replicate,s1,r38472/1:/Table/68/1/"z49vqv682sqqpd/m…] slow lease acquisition finished after 1m0.000185174s with error [NotLeaseHolderError] r38472: replica (n1,s1):1 not lease holder; lease holder unknown after 1 attempts
E200828 20:17:51.949446 622678 server/status.go:1041  [n1,status] aborted during DistSender.Send: context deadline exceeded
W200828 20:17:52.299845 1649 kv/kvserver/node_liveness.go:563  [n1,liveness-hb] slow heartbeat took 4.5s
W200828 20:17:52.299905 1649 kv/kvserver/node_liveness.go:488  [n1,liveness-hb] failed node liveness heartbeat: operation "node liveness heartbeat" timed out after 4.5s
(1) operation "node liveness heartbeat" timed out after 4.5s
Wraps: (2) context deadline exceeded
Error types: (1) *contextutil.TimeoutError (2) context.deadlineExceededError
W200828 20:17:52.646372 597670 kv/kvserver/replica_range_lease.go:1034  [n1,gc,s1,r53548/1:/Table/69/1/"mid:fe5{00…-93…}] have been waiting 1m0s attempting to acquire lease
I200828 20:17:52.646467 597670 kv/kvserver/replica_range_lease.go:1039  [n1,gc,s1,r53548/1:/Table/69/1/"mid:fe5{00…-93…}] slow lease acquisition finished after 1m0.001302599s with error [NotLeaseHolderError] r53548: replica (n1,s1):1 not lease holder; lease holder unknown after 1 attempts
W200828 20:17:52.722833 193 kv/kvserver/closedts/provider/provider.go:152  [ct-closer] unable to move closed timestamp forward: not live
github.com/cockroachdb/cockroach/pkg/kv/kvserver.init
        /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/node_liveness.go:57
runtime.doInit
        /usr/local/go/src/runtime/proc.go:5222
runtime.doInit
        /usr/local/go/src/runtime/proc.go:5217
runtime.doInit
        /usr/local/go/src/runtime/proc.go:5217
runtime.doInit
        /usr/local/go/src/runtime/proc.go:5217
runtime.main
        /usr/local/go/src/runtime/proc.go:190
runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:1357
W200828 20:17:53.188253 598204 kv/kvclient/kvcoord/dist_sender.go:1493  [n1,raftlog,s1,r5793/1:/Table/68/4/"hid:400{3e…-c1…}] have been waiting 60.00s (1 attempts) for RPC to r5793:/Table/68/4/"hid:400{3e53142daa3675165d4fd1551b422694508a4f86e8af2e1cedadc7c9dd5378f5f2f65d2e69dcc8df23501df65b456"-c1b12281fb4dddbc572c6283052f7da6d58fd3f5e27d45ea8051d32b30c70eaef97fead056d7d390df80472f75117"} [(n1,s1):1, (n2,s2):2, (n5,s5):5, next=6, gen=3?]: aborted during DistSender.Send: context deadline exceeded
W200828 20:17:53.188295 598204 kv/kvclient/kvcoord/dist_sender.go:1496  [n1,raftlog,s1,r5793/1:/Table/68/4/"hid:400{3e…-c1…}] slow RPC finished after 60.00s (1 attempts)
E200828 20:17:53.188379 598204 kv/kvserver/queue.go:1090  [n1,raftlog,s1,r5793/1:/Table/68/4/"hid:400{3e…-c1…}] operation "raftlog queue process replica 5793" timed out after 1m0s: aborted during DistSender.Send: context deadline exceeded
W200828 20:17:54.360801 599113 kv/kvserver/replica_range_lease.go:1034  [n1,merge,s1,r41966/1:/Table/69/1/"mid:ab67b6c…] have been waiting 1m0s attempting to acquire lease
I200828 20:17:54.360850 599113 kv/kvserver/replica_range_lease.go:1039  [n1,merge,s1,r41966/1:/Table/69/1/"mid:ab67b6c…] slow lease acquisition finished after 1m0.000165923s with error [NotLeaseHolderError] r41966: replica (n1,s1):1 not lease holder; lease holder unknown after 1 attempts
W200828 20:17:54.360933 599114 kv/kvserver/node_liveness.go:563  [n1,s1,r41966/1:/Table/69/1/"mid:ab67b6c…] slow heartbeat took 60.0s
E200828 20:17:54.360979 599114 kv/kvserver/replica_range_lease.go:339  [n1,s1,r41966/1:/Table/69/1/"mid:ab67b6c…] context canceled
I200828 20:17:56.291683 1705 server/status/runtime.go:498  [n1] runtime stats: 2.5 GiB RSS, 1071 goroutines, 1.1 GiB/228 MiB/1.6 GiB GO alloc/idle/total, 344 MiB/652 MiB CGO alloc/total, 7022.6 CGO/sec, 256.9/32.5 %(u/s)time, 0.0 %gc (1x), 805 KiB/804 KiB (r/w)net
W200828 20:17:56.800105 1649 kv/kvserver/node_liveness.go:563  [n1,liveness-hb] slow heartbeat took 4.5s
W200828 20:17:56.800188 1649 kv/kvserver/node_liveness.go:488  [n1,liveness-hb] failed node liveness heartbeat: operation "node liveness heartbeat" timed out after 4.5s
(1) operation "node liveness heartbeat" timed out after 4.5s
Wraps: (2) context deadline exceeded
Error types: (1) *contextutil.TimeoutError (2) context.deadlineExceededError
E200828 20:17:57.380572 601265 kv/kvserver/queue.go:1090  [n1,raftlog,s1,r46529/1:/Table/69/2/"mid:c8{33d…-76d…}] operation "raftlog queue process replica 46529" timed out after 1m0s: aborted during DistSender.Send: context deadline exceeded
W200828 20:18:01.300830 1649 kv/kvserver/node_liveness.go:563  [n1,liveness-hb] slow heartbeat took 4.5s
W200828 20:18:01.300914 1649 kv/kvserver/node_liveness.go:488  [n1,liveness-hb] failed node liveness heartbeat: operation "node liveness heartbeat" timed out after 4.5s
(1) operation "node liveness heartbeat" timed out after 4.5s
Wraps: (2) context deadline exceeded
Error types: (1) *contextutil.TimeoutError (2) context.deadlineExceededErro

To Reproduce

Running on three different nodes:

docker run -d -p 26257:26257 -p 8085:8080 -v /var/server/cockroach-data:/cockroach/cockroach-data -v /var/server/certs:/etc/ssl/private --name cockroach cockroachdb/cockroach:v20.1.4 start --insecure --advertise-addr=10.67.42.158 --join=10.67.42.157,10.67.42.158,10.67.42.159 --max-sql-memory=1024MiB

Database size is 1.5TB

Environment:

CockroachDB version 20.1.4
Server OS: Redhat Linux Enterprise
Client app [e.g. cockroach sql, JDBC, ...]

The text was updated successfully, but these errors were encountered:

blathers-crl · 2020-08-28T20:23:51Z

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

@nvanbenschoten (found keywords: kv,raft)
@solongordon (found keywords: join)

If we have not gotten back to your issue within a few business days, you can try the following:

Join our community slack channel and ask on #cockroachdb.
Try find someone from here if you know they worked closely on the area and CC them.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

luiscosio · 2020-08-28T20:58:49Z

As additional information, this what happens after a couple of minutes of the cluster is running:

luiscosio · 2020-08-31T05:55:52Z

Pretty similar to this issue, where the cluster stays online just half a minute:

#45929 (comment)

irfansharif · 2020-08-31T14:43:38Z

Hi @luiscosio, mind sharing details about the machines these CRDB nodes are running on? Given there's 1.5 TB of data being stored in, I'm just checking to see that these boxes aren't overloaded.

luiscosio · 2020-08-31T15:06:14Z

Hello @irfansharif, the servers have 64vCPUs and 64GB RAM.

irfansharif · 2020-08-31T15:11:12Z

@nvanbenschoten, @aayushshah15: Is this the sort of thing that #51888 and #51894 would also help mitigate? What was CRDB's behavior on cold restarts when there are as many ranges as shown above? Is that now improved?

luiscosio · 2020-08-31T15:52:47Z

That is only available at v20.2.0-alpha.3, right @irfansharif? What would recommend starting the cluster again?

I see over #46660 (comment) a couple of configuration suggestions, but I do not find anywhere in the docs where I can configure max_range_bytes=512M, min_range_bytes=128M

irfansharif · 2020-08-31T16:07:54Z

That is only available at v20.2.0-alpha.3, right @irfansharif?

Yes, we're hoping to include those patches in our upcoming 20.2 release, but if you're open to trying out the alpha in a test setup, I'd be curious to know what happens.

As for setting the range size, in 20.1 we should already default to 512 MB. The link you were looking for is https://www.cockroachlabs.com/docs/stable/configure-zone.html#variables.

luiscosio · 2020-08-31T16:52:17Z

We'll try the alpha release and let you know if this fixes the issue.

Regarding setting COCKROACH_SCAN_INTERVAL=60m, and COCKROACH_SCAN_MIN_IDLE_TIME=1s, in a dockerized cockroach environment, how would you set up those variables?

irfansharif · 2020-08-31T18:35:37Z

I believe those are expected to be set as environment variables when running the cockroach binary.

luiscosio · 2020-08-31T18:37:38Z

Yes, you are right. We set up those variables and now the cluster is back online. We will test with the alpha version as well and report the results.

On Mon 31 Aug 2020 at 13:35 irfan sharif ***@***.***> wrote: I believe those are expected to be set as environment variables when running the cockroach binary. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#53609 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAANATABXAA7HGX3O4SAYLSDPUQRANCNFSM4QOPTFIQ> . --

[image: twitter] <https://twitter.com/luiscosio> [image: linkedin] <https://www.linkedin.com/in/luiscosio/> [image: instagram] <https://www.instagram.com/luiscosio/> Luis Cosio Artificial intelligence and cloud computing (+52) 1 33 2951 0674 <(+52)+1+33+2951+0674>

luiscosio · 2020-09-27T01:54:29Z

@irfansharif The cluster is working just fine, but since we updated it per this ticket to be able to bring it back online from a cold start, now we are experiencing this issue:

#54842

jordanlewis · 2020-11-09T14:51:34Z

Closing due to age.

blathers-crl bot added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community X-blathers-triaged blathers was able to find an owner labels Aug 28, 2020

jordanlewis closed this as completed Nov 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster stops working after a couple of minutes #53609

Cluster stops working after a couple of minutes #53609

luiscosio commented Aug 28, 2020

blathers-crl bot commented Aug 28, 2020

luiscosio commented Aug 28, 2020

luiscosio commented Aug 31, 2020

irfansharif commented Aug 31, 2020

luiscosio commented Aug 31, 2020

irfansharif commented Aug 31, 2020

luiscosio commented Aug 31, 2020

irfansharif commented Aug 31, 2020

luiscosio commented Aug 31, 2020

irfansharif commented Aug 31, 2020

luiscosio commented Aug 31, 2020 via email

luiscosio commented Sep 27, 2020

jordanlewis commented Nov 9, 2020

Cluster stops working after a couple of minutes #53609

Cluster stops working after a couple of minutes #53609

Comments

luiscosio commented Aug 28, 2020

blathers-crl bot commented Aug 28, 2020

luiscosio commented Aug 28, 2020

luiscosio commented Aug 31, 2020

irfansharif commented Aug 31, 2020

luiscosio commented Aug 31, 2020

irfansharif commented Aug 31, 2020

luiscosio commented Aug 31, 2020

irfansharif commented Aug 31, 2020

luiscosio commented Aug 31, 2020

irfansharif commented Aug 31, 2020

luiscosio commented Aug 31, 2020 via email

luiscosio commented Sep 27, 2020

jordanlewis commented Nov 9, 2020