Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster stops working after a couple of minutes #53609

Closed
luiscosio opened this issue Aug 28, 2020 · 13 comments
Closed

Cluster stops working after a couple of minutes #53609

luiscosio opened this issue Aug 28, 2020 · 13 comments
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community X-blathers-triaged blathers was able to find an owner

Comments

@luiscosio
Copy link

Describe the problem

After having a cluster running for a couple of minutes, the three nodes status go to suspect, and then the cluster stops working. Here is the error:

W200828 20:17:47.799579 1649 kv/kvserver/node_liveness.go:488  [n1,liveness-hb] failed node liveness heartbeat: operation "node liveness heartbeat" timed out after 4.5s
(1) operation "node liveness heartbeat" timed out after 4.5s
Wraps: (2) context deadline exceeded
Error types: (1) *contextutil.TimeoutError (2) context.deadlineExceededError
W200828 20:17:49.954217 595411 kv/kvclient/kvcoord/dist_sender.go:1493  [n1,replicaGC,s1,r47699/1:/Table/68/3/"8pm98qp8tpxfs6"/…] have been waiting 60.00s (1 attempts) for RPC to r1:/{Min-System} [(n1,s1):1, (n2,s2):2, (n5,s5):5, next=6, gen=1?]: aborted during DistSender.Send: context deadline exceeded
W200828 20:17:49.954271 595411 kv/kvclient/kvcoord/dist_sender.go:1496  [n1,replicaGC,s1,r47699/1:/Table/68/3/"8pm98qp8tpxfs6"/…] slow RPC finished after 60.00s (1 attempts)
E200828 20:17:49.954339 595411 kv/kvserver/queue.go:1090  [n1,replicaGC,s1,r47699/1:/Table/68/3/"8pm98qp8tpxfs6"/…] operation "replicaGC queue process replica 47699" timed out after 1m0s: aborted during DistSender.Send: context deadline exceeded
W200828 20:17:51.053006 596174 kv/kvserver/replica_range_lease.go:1034  [n1,consistencyChecker,s1,r48718/1:/Table/68/1/"z49vqv682sqqpd/m…] have been waiting 1m0s attempting to acquire lease
I200828 20:17:51.053086 596174 kv/kvserver/replica_range_lease.go:1039  [n1,consistencyChecker,s1,r48718/1:/Table/68/1/"z49vqv682sqqpd/m…] slow lease acquisition finished after 1m0.000507997s with error [NotLeaseHolderError] r48718: replica (n1,s1):1 not lease holder; lease holder unknown after 1 attempts
W200828 20:17:51.198190 592273 kv/kvserver/replica_range_lease.go:1034  [n1,replicate,s1,r38472/1:/Table/68/1/"z49vqv682sqqpd/m…] have been waiting 1m0s attempting to acquire lease
I200828 20:17:51.198230 592273 kv/kvserver/replica_range_lease.go:1039  [n1,replicate,s1,r38472/1:/Table/68/1/"z49vqv682sqqpd/m…] slow lease acquisition finished after 1m0.000185174s with error [NotLeaseHolderError] r38472: replica (n1,s1):1 not lease holder; lease holder unknown after 1 attempts
E200828 20:17:51.949446 622678 server/status.go:1041  [n1,status] aborted during DistSender.Send: context deadline exceeded
W200828 20:17:52.299845 1649 kv/kvserver/node_liveness.go:563  [n1,liveness-hb] slow heartbeat took 4.5s
W200828 20:17:52.299905 1649 kv/kvserver/node_liveness.go:488  [n1,liveness-hb] failed node liveness heartbeat: operation "node liveness heartbeat" timed out after 4.5s
(1) operation "node liveness heartbeat" timed out after 4.5s
Wraps: (2) context deadline exceeded
Error types: (1) *contextutil.TimeoutError (2) context.deadlineExceededError
W200828 20:17:52.646372 597670 kv/kvserver/replica_range_lease.go:1034  [n1,gc,s1,r53548/1:/Table/69/1/"mid:fe5{00…-93…}] have been waiting 1m0s attempting to acquire lease
I200828 20:17:52.646467 597670 kv/kvserver/replica_range_lease.go:1039  [n1,gc,s1,r53548/1:/Table/69/1/"mid:fe5{00…-93…}] slow lease acquisition finished after 1m0.001302599s with error [NotLeaseHolderError] r53548: replica (n1,s1):1 not lease holder; lease holder unknown after 1 attempts
W200828 20:17:52.722833 193 kv/kvserver/closedts/provider/provider.go:152  [ct-closer] unable to move closed timestamp forward: not live
github.com/cockroachdb/cockroach/pkg/kv/kvserver.init
        /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/node_liveness.go:57
runtime.doInit
        /usr/local/go/src/runtime/proc.go:5222
runtime.doInit
        /usr/local/go/src/runtime/proc.go:5217
runtime.doInit
        /usr/local/go/src/runtime/proc.go:5217
runtime.doInit
        /usr/local/go/src/runtime/proc.go:5217
runtime.main
        /usr/local/go/src/runtime/proc.go:190
runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:1357
W200828 20:17:53.188253 598204 kv/kvclient/kvcoord/dist_sender.go:1493  [n1,raftlog,s1,r5793/1:/Table/68/4/"hid:400{3e…-c1…}] have been waiting 60.00s (1 attempts) for RPC to r5793:/Table/68/4/"hid:400{3e53142daa3675165d4fd1551b422694508a4f86e8af2e1cedadc7c9dd5378f5f2f65d2e69dcc8df23501df65b456"-c1b12281fb4dddbc572c6283052f7da6d58fd3f5e27d45ea8051d32b30c70eaef97fead056d7d390df80472f75117"} [(n1,s1):1, (n2,s2):2, (n5,s5):5, next=6, gen=3?]: aborted during DistSender.Send: context deadline exceeded
W200828 20:17:53.188295 598204 kv/kvclient/kvcoord/dist_sender.go:1496  [n1,raftlog,s1,r5793/1:/Table/68/4/"hid:400{3e…-c1…}] slow RPC finished after 60.00s (1 attempts)
E200828 20:17:53.188379 598204 kv/kvserver/queue.go:1090  [n1,raftlog,s1,r5793/1:/Table/68/4/"hid:400{3e…-c1…}] operation "raftlog queue process replica 5793" timed out after 1m0s: aborted during DistSender.Send: context deadline exceeded
W200828 20:17:54.360801 599113 kv/kvserver/replica_range_lease.go:1034  [n1,merge,s1,r41966/1:/Table/69/1/"mid:ab67b6c…] have been waiting 1m0s attempting to acquire lease
I200828 20:17:54.360850 599113 kv/kvserver/replica_range_lease.go:1039  [n1,merge,s1,r41966/1:/Table/69/1/"mid:ab67b6c…] slow lease acquisition finished after 1m0.000165923s with error [NotLeaseHolderError] r41966: replica (n1,s1):1 not lease holder; lease holder unknown after 1 attempts
W200828 20:17:54.360933 599114 kv/kvserver/node_liveness.go:563  [n1,s1,r41966/1:/Table/69/1/"mid:ab67b6c…] slow heartbeat took 60.0s
E200828 20:17:54.360979 599114 kv/kvserver/replica_range_lease.go:339  [n1,s1,r41966/1:/Table/69/1/"mid:ab67b6c…] context canceled
I200828 20:17:56.291683 1705 server/status/runtime.go:498  [n1] runtime stats: 2.5 GiB RSS, 1071 goroutines, 1.1 GiB/228 MiB/1.6 GiB GO alloc/idle/total, 344 MiB/652 MiB CGO alloc/total, 7022.6 CGO/sec, 256.9/32.5 %(u/s)time, 0.0 %gc (1x), 805 KiB/804 KiB (r/w)net
W200828 20:17:56.800105 1649 kv/kvserver/node_liveness.go:563  [n1,liveness-hb] slow heartbeat took 4.5s
W200828 20:17:56.800188 1649 kv/kvserver/node_liveness.go:488  [n1,liveness-hb] failed node liveness heartbeat: operation "node liveness heartbeat" timed out after 4.5s
(1) operation "node liveness heartbeat" timed out after 4.5s
Wraps: (2) context deadline exceeded
Error types: (1) *contextutil.TimeoutError (2) context.deadlineExceededError
E200828 20:17:57.380572 601265 kv/kvserver/queue.go:1090  [n1,raftlog,s1,r46529/1:/Table/69/2/"mid:c8{33d…-76d…}] operation "raftlog queue process replica 46529" timed out after 1m0s: aborted during DistSender.Send: context deadline exceeded
W200828 20:18:01.300830 1649 kv/kvserver/node_liveness.go:563  [n1,liveness-hb] slow heartbeat took 4.5s
W200828 20:18:01.300914 1649 kv/kvserver/node_liveness.go:488  [n1,liveness-hb] failed node liveness heartbeat: operation "node liveness heartbeat" timed out after 4.5s
(1) operation "node liveness heartbeat" timed out after 4.5s
Wraps: (2) context deadline exceeded
Error types: (1) *contextutil.TimeoutError (2) context.deadlineExceededErro

To Reproduce

Running on three different nodes:

docker run -d -p 26257:26257 -p 8085:8080 -v /var/server/cockroach-data:/cockroach/cockroach-data -v /var/server/certs:/etc/ssl/private --name cockroach cockroachdb/cockroach:v20.1.4 start --insecure --advertise-addr=10.67.42.158 --join=10.67.42.157,10.67.42.158,10.67.42.159 --max-sql-memory=1024MiB

Database size is 1.5TB

Environment:

  • CockroachDB version 20.1.4
  • Server OS: Redhat Linux Enterprise
  • Client app [e.g. cockroach sql, JDBC, ...]
@blathers-crl
Copy link

blathers-crl bot commented Aug 28, 2020

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

If we have not gotten back to your issue within a few business days, you can try the following:

  • Join our community slack channel and ask on #cockroachdb.
  • Try find someone from here if you know they worked closely on the area and CC them.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@blathers-crl blathers-crl bot added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community X-blathers-triaged blathers was able to find an owner labels Aug 28, 2020
@luiscosio
Copy link
Author

As additional information, this what happens after a couple of minutes of the cluster is running:

image

@luiscosio
Copy link
Author

Pretty similar to this issue, where the cluster stays online just half a minute:

#45929 (comment)

@irfansharif
Copy link
Contributor

Hi @luiscosio, mind sharing details about the machines these CRDB nodes are running on? Given there's 1.5 TB of data being stored in, I'm just checking to see that these boxes aren't overloaded.

@luiscosio
Copy link
Author

Hello @irfansharif, the servers have 64vCPUs and 64GB RAM.

@irfansharif
Copy link
Contributor

@nvanbenschoten, @aayushshah15: Is this the sort of thing that #51888 and #51894 would also help mitigate? What was CRDB's behavior on cold restarts when there are as many ranges as shown above? Is that now improved?

@luiscosio
Copy link
Author

That is only available at v20.2.0-alpha.3, right @irfansharif? What would recommend starting the cluster again?

I see over #46660 (comment) a couple of configuration suggestions, but I do not find anywhere in the docs where I can configure max_range_bytes=512M, min_range_bytes=128M

@irfansharif
Copy link
Contributor

That is only available at v20.2.0-alpha.3, right @irfansharif?

Yes, we're hoping to include those patches in our upcoming 20.2 release, but if you're open to trying out the alpha in a test setup, I'd be curious to know what happens.

As for setting the range size, in 20.1 we should already default to 512 MB. The link you were looking for is https://www.cockroachlabs.com/docs/stable/configure-zone.html#variables.

@luiscosio
Copy link
Author

We'll try the alpha release and let you know if this fixes the issue.

Regarding setting COCKROACH_SCAN_INTERVAL=60m, and COCKROACH_SCAN_MIN_IDLE_TIME=1s, in a dockerized cockroach environment, how would you set up those variables?

@irfansharif
Copy link
Contributor

I believe those are expected to be set as environment variables when running the cockroach binary.

@luiscosio
Copy link
Author

luiscosio commented Aug 31, 2020 via email

@luiscosio
Copy link
Author

@irfansharif The cluster is working just fine, but since we updated it per this ticket to be able to bring it back online from a cold start, now we are experiencing this issue:

#54842

@jordanlewis
Copy link
Member

Closing due to age.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community X-blathers-triaged blathers was able to find an owner
Projects
None yet
Development

No branches or pull requests

3 participants