Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

5-node cluster unavailable for an hour after one server reboots #20114

Closed
gpaul opened this issue Nov 17, 2017 · 63 comments
Closed

5-node cluster unavailable for an hour after one server reboots #20114

gpaul opened this issue Nov 17, 2017 · 63 comments
Assignees
Labels
C-question A question rather than an issue. No code/spec/doc change needed. O-community Originated from the community
Milestone

Comments

@gpaul
Copy link

gpaul commented Nov 17, 2017

Is this a question, feature request, or bug report?

BUG REPORT

  1. Please supply the header (i.e. the first few lines) of your most recent
    log file for each node in your cluster. On most unix-based systems
    running with defaults, this boils down to the output of

    grep -F '[config]' cockroach-data/logs/cockroach.log

    When log files are not available, supply the output of cockroach version
    and all flags/environment variables passed to cockroach start instead.

There are 5 nodes in the cluster, all 5 share the following cmdline and version:

2017-11-09 14:14:33: + exec /opt/mesosphere/active/cockroach/bin/cockroach start --logtostderr --cache=100MiB --store=/var/lib/dcos/cockroach --certs-dir=/run/dcos/pki/cockroach --advertise-host=17.121.4.134 --host=17.121.4.134 --port=26257 --http-host=127.0.0.1 --http-port=8090 --log-dir= --join=17.121.4.196,17.121.4.134,17.121.4.3,17.121.4.66,17.121.4.195
2017-11-09 14:14:33: I171109 14:14:33.936738 1 cli/start.go:593  CockroachDB CCL v1.0.6 (linux amd64, built 2017/10/25 17:35:28, go1.8.3)
2017-11-09 14:14:34: I171109 14:14:34.037306 1 server/config.go:281  available memory from cgroups (8.0 EiB) exceeds system memory 188 GiB, using system memory
2017-11-09 14:14:34: I171109 14:14:34.037346 1 server/config.go:375  system total memory: 188 GiB
2017-11-09 14:14:34: I171109 14:14:34.037471 1 server/config.go:377  server configuration:
2017-11-09 14:14:34: max offset                   500ms
2017-11-09 14:14:34: cache size                   100 MiB
2017-11-09 14:14:34: SQL memory pool size         47 GiB
2017-11-09 14:14:34: scan interval                10m0s
2017-11-09 14:14:34: scan max idle time           200ms
2017-11-09 14:14:34: consistency check interval   24h0m0s
2017-11-09 14:14:34: metrics sample interval      10s
2017-11-09 14:14:34: time until store dead        5m0s
2017-11-09 14:14:34: send next timeout            10m0s
2017-11-09 14:14:34: event log enabled            true
2017-11-09 14:14:34: linearizable                 false
2017-11-09 14:14:34: I171109 14:14:34.037627 12 cli/start.go:340  starting cockroach node
2017-11-09 14:14:34: I171109 14:14:34.037650 12 cli/start.go:342  using local environment variables: COCKROACH_SKIP_ENABLING_DIAGNOSTIC_REPORTING=true
2017-11-09 14:14:34: W171109 14:14:34.037793 12 security/certificate_loader.go:240  error finding key for /run/dcos/pki/cockroach/client.root.crt: could not read key file /run/dcos/pki/cockroach/client.root.key: open /run/dcos/pki/cockroach/client.root.key: permission denied
2017-11-09 14:14:34: W171109 14:14:34.043623 12 gossip/gossip.go:1196  [n?] no incoming or outgoing connections
2017-11-09 14:14:34: I171109 14:14:34.043842 12 storage/engine/rocksdb.go:380  opening rocksdb instance at "/var/lib/dcos/cockroach"
2017-11-09 14:14:34: I171109 14:14:34.070567 98 gossip/client.go:131  [n?] started gossip client to 17.121.4.196:26257
2017-11-09 14:14:34: I171109 14:14:34.412713 12 server/config.go:475  [n?] 1 storage engine initialized
2017-11-09 14:14:34: I171109 14:14:34.412757 12 server/config.go:477  [n?] RocksDB cache size: 100 MiB
2017-11-09 14:14:34: I171109 14:14:34.412766 12 server/config.go:477  [n?] store 0: RocksDB, max size 0 B, max open file limit -1
2017-11-09 14:14:34: I171109 14:14:34.430757 12 server/server.go:672  [n?] sleeping for 112.69279ms to guarantee HLC monotonicity
2017-11-09 14:14:34: I171109 14:14:34.686302 12 server/node.go:467  [n2] initialized store [n2,s2]: {Capacity:997251764224 Available:992909885440 RangeCount:77 LeaseCount:0}
2017-11-09 14:14:34: I171109 14:14:34.686355 12 server/node.go:351  [n2] node ID 2 initialized
2017-11-09 14:14:34: I171109 14:14:34.686465 12 gossip/gossip.go:297  [n2] NodeDescriptor set to node_id:2 address:<network_field:"tcp" address_field:"17.121.4.134:26257" > attrs:<> locality:<>
2017-11-09 14:14:34: I171109 14:14:34.686589 12 storage/stores.go:296  [n2] read 4 node addresses from persistent storage
2017-11-09 14:14:34: I171109 14:14:34.686764 12 server/node.go:608  [n2] connecting to gossip network to verify cluster ID...
2017-11-09 14:14:34: I171109 14:14:34.686790 12 server/node.go:633  [n2] node connected via gossip and verified as part of cluster "8c1152d0-511e-4419-9ac2-ff1e15f5a74e"
2017-11-09 14:14:34: I171109 14:14:34.686840 12 server/node.go:405  [n2] node=2: started with [[]=/var/lib/dcos/cockroach] engine(s) and attributes []
2017-11-09 14:14:34: I171109 14:14:34.686952 12 sql/executor.go:338  [n2] creating distSQLPlanner with address {tcp 17.121.4.134:26257}
2017-11-09 14:14:34: I171109 14:14:34.687264 98 gossip/client.go:136  [n2] closing client to node 1 (17.121.4.196:26257): received forward from node 1 to 5 (17.121.4.195:26257)
2017-11-09 14:14:34: I171109 14:14:34.687342 94 gossip/gossip.go:1210  [n2] node has connected to cluster via gossip
2017-11-09 14:14:34: I171109 14:14:34.687531 94 storage/stores.go:312  [n2] wrote 4 node addresses to persistent storage
2017-11-09 14:14:34: I171109 14:14:34.707236 546 gossip/client.go:131  [n2] started gossip client to 17.121.4.195:26257
2017-11-09 14:14:34: I171109 14:14:34.709813 12 server/server.go:735  [n2] starting https server at 127.0.0.1:8090
2017-11-09 14:14:34: I171109 14:14:34.709835 12 server/server.go:736  [n2] starting grpc/postgres server at 17.121.4.134:26257
2017-11-09 14:14:34: I171109 14:14:34.709846 12 server/server.go:737  [n2] advertising CockroachDB node at 17.121.4.134:26257
2017-11-09 14:14:34: I171109 14:14:34.710824 12 server/server.go:847  [n2] done ensuring all necessary migrations have run
2017-11-09 14:14:34: I171109 14:14:34.711342 12 server/server.go:849  [n2] serving sql connections
2017-11-09 14:14:34: I171109 14:14:34.713000 12 cli/start.go:411  node startup completed:
2017-11-09 14:14:34: CockroachDB node starting at 2017-11-09 14:14:34.711565172 -0800 PST
2017-11-09 14:14:34: build:      CCL v1.0.6 @ 2017/10/25 17:35:28 (go1.8.3)
2017-11-09 14:14:34: admin:      https://127.0.0.1:8090
2017-11-09 14:14:34: sql:        postgresql://[email protected]:26257?sslmode=verify-full&sslrootcert=%2Frun%2Fdcos%2Fpki%2Fcockroach%2Fca.crt
2017-11-09 14:14:34: logs:
2017-11-09 14:14:34: store[0]:   path=/var/lib/dcos/cockroach
2017-11-09 14:14:34: status:     restarted pre-existing node
2017-11-09 14:14:34: clusterID:  8c1152d0-511e-4419-9ac2-ff1e15f5a74e
2017-11-09 14:14:34: nodeID:     2
  1. Please describe the issue you observed:
  • What did you do?

A cluster of 5 cockroachdb nodes running v1.0.6 was running correctly.

We restarted one of the servers.

After a few minutes we noticed that our queries got slower quickly and then the cluster became unavailable for ~1h10min.

The cluster managed healed itself after an hour and queries started working again.

There is a single database in the cluster and it contains very little data (~1000 records).

There is one client per cockroachdb instance, each of which performs ~ 1req/s.

  • What did you expect to see?

The cluster remains available for queries.

  • What did you see instead?

The server is rebooted at 2017-11-09 14:09:31

2017-11-09 14:09:31: W171109 14:09:31.893605 608 storage/raft_transport.go:442  [n3] raft transport stream to node 2 failed: rpc error: code = Internal desc = transport: write tcp 17.121.4.3
:33050->17.121.4.134:26257: use of closed network connection

Queries on other nodes begin slowing down rapidly after ~5 mins:

2017-11-09 14:13:47: 17.115.238.29 [09/Nov/2017:14:13:47 -0800] "GET /snipped/ HTTP/1.0" 200 22 "-" "-" (0.010806 
s)
...
2017-11-09 14:14:44: 17.115.238.29 [09/Nov/2017:14:14:44 -0800] "GET /snipped/ HTTP/1.0" 200 22 "-" "-" (0.175371 s)
...

2017-11-09 14:14:46: 17.115.238.29 [09/Nov/2017:14:14:46 -0800] "GET /snipped/ HTTP/1.0" 200 22 "-" "-" (1.249569 s)
...
2017-11-09 14:14:52: 17.115.238.29 [09/Nov/2017:14:14:52 -0800] "GET /snipped/ HTTP/1.0" 200 22 "-" "-" (4.864235 s)
...
2017-11-09 14:15:10: 17.115.238.29 [09/Nov/2017:14:15:10 -0800] "GET /snipped/ HTTP/1.0" 200 22 "-" "-" (14.931100 s)
...
2017-11-09 14:16:06: 17.115.238.29 [09/Nov/2017:14:16:06 -0800] "GET /snipped/ HTTP/1.0" 200 22 "-" "-" (34.449395 s)
...
2017-11-09 14:16:16: 17.121.4.66, 17.121.4.137 [09/Nov/2017:14:16:16 -0800] "GET /snipped/ HTTP/1.0" 200 22 "-" "-" (37.076515 s)

At this point, all client queries stall completely, not a single request is served for over an hour.

...
2017-11-09 15:24:37: 17.121.4.196, 17.121.4.69 [09/Nov/2017:15:24:37 -0800] "GET /snipped/ HTTP/1.0" 200 22 "-" "-" (0.036757 s)
...

The first request is served, at normal speed.

At this time the following logs are observed across all 4 remaining cockroachdb instances:

...
2017-11-09 14:14:44: gossip client (1/3 cur/max conns)
2017-11-09 14:14:44: 1: 17.121.4.196:26257 (5m13s: infos 72/2651 sent/received, bytes 26461B/322903B sent/received)
2017-11-09 14:14:44: gossip server (0/3 cur/max conns, infos 11128/6661672 sent/received, bytes 2408122B/667790548B sent/received)
2017-11-09 14:14:44: I171109 14:14:44.466773 502 server/status/runtime.go:225  [n3] runtime stats: 1.3 GiB RSS, 15799 goroutines, 472 MiB/11 MiB/853 MiB GO alloc/idle/total, 140 MiB/317 MiB CGO alloc/total, 1448.52cgo/sec, 1.09/0.23 %(u/s)time, 0.00 %gc (3x)
2017-11-09 14:14:44: I171109 14:14:44.497467 64 storage/replica_raftstorage.go:415  [replicate,n3,s3,r177/1:/System/tsd/cr.node.txn.resta…] generated preemptive snapshot bc0c21ec at index 24
2017-11-09 14:14:44: I171109 14:14:44.497887 64 storage/replicate_queue.go:235  [replicate,n3,s3,r177/1:/System/tsd/cr.node.txn.resta…] snapshot failed: (n2,s2):?: remote declined snapshot: reservation rejected
2017-11-09 14:14:45: I171109 14:14:45.499074 64 storage/replica_raftstorage.go:415  [replicate,n3,s3,r127/1:/System/tsd/cr.node.{sy…-tx…}] generated preemptive snapshot 8e588715 at index 139371
2017-11-09 14:14:54: I171109 14:14:54.466928 502 server/status/runtime.go:225  [n3] runtime stats: 1.5 GiB RSS, 15844 goroutines, 529 MiB/67 MiB/1.0 GiB GO alloc/idle/total, 140 MiB/318 MiB CGO alloc/total, 1313.99cgo/sec, 0.91/0.28 %(u/s)time, 0.00 %gc (2x)
2017-11-09 14:14:54: I171109 14:14:54.751672 64 storage/store.go:3364  [replicate,n3,s3,r127/1:/System/tsd/cr.node.{sy…-tx…}] streamed snapshot to (n2,s2):?: kv pairs: 3502, log entries: 16, rate-limit: 2.0 MiB/sec, 9252ms
2017-11-09 14:14:55: I171109 14:14:55.009121 64 storage/replica_command.go:3575  [replicate,n3,s3,r127/1:/System/tsd/cr.node.{sy…-tx…}] change replicas (ADD_REPLICA (n2,s2):4): read existing descriptor r127:/System/tsd/cr.node.{sys.goroutines/2/10s/2017-11-03T01:00:00Z-txn.aborts/3/10s/2017-11-03T23:00:00Z} [(n3,s3):1, (n4,s4):2, (n1,s1):3, next=4]
2017-11-09 14:15:04: I171109 14:15:04.466160 502 server/status/runtime.go:225  [n3] runtime stats: 1.5 GiB RSS, 15831 goroutines, 663 MiB/1.4 MiB/1.0 GiB GO alloc/idle/total, 142 MiB/319 MiB CGO alloc/total, 487.63cgo/sec, 0.16/0.05 %(u/s)time, 0.00 %gc (0x)
2017-11-09 14:15:14: I171109 14:15:14.466204 502 server/status/runtime.go:225  [n3] runtime stats: 1.5 GiB RSS, 15831 goroutines, 402 MiB/167 MiB/1.0 GiB GO alloc/idle/total, 141 MiB/319 MiB CGO alloc/total, 149.80cgo/sec, 0.31/0.08 %(u/s)time, 0.00 %gc (1x)
2017-11-09 14:15:24: I171109 14:15:24.467615 502 server/status/runtime.go:225  [n3] runtime stats: 1.5 GiB RSS, 15824 goroutines, 468 MiB/155 MiB/1.0 GiB GO alloc/idle/total, 141 MiB/319 MiB CGO alloc/total, 174.58cgo/sec, 0.09/0.03 %(u/s)time, 0.00 %gc (0x)
2017-11-09 14:15:34: I171109 14:15:34.467528 502 server/status/runtime.go:225  [n3] runtime stats: 1.5 GiB RSS, 15826 goroutines, 517 MiB/142 MiB/1.0 GiB GO alloc/idle/total, 142 MiB/320 MiB CGO alloc/total, 152.70cgo/sec, 0.08/0.03 %(u/s)time, 0.00 %gc (0x)
2017-11-09 14:15:43: W171109 14:15:43.223332 30543758 kv/dist_sender.go:1190  [n3] have been waiting 1m0s sending RPC to r6 for batch: [txn: b1ad86fa], BeginTransaction [/Table/11/1/52/1/2017-11-08T02:35:00.022268Z/3,/Min), DeleteRange [/Table/11/1/52/1/2017-11-08T02:35:00.022268Z/3,/Table/11/1/52/1/2017-11-08T02:35:00.022268Z/4)
2017-11-09 14:15:43: W171109 14:15:43.229209 30547366 kv/dist_sender.go:1190  [n3] have been waiting 1m0s sending RPC to r6 for batch: [txn: 2d8e6329], BeginTransaction [/Table/11/1/53/1/2017-11-09T07:29:17.897156Z/3,/Min), DeleteRange [/Table/11/1/53/1/2017-11-09T07:29:17.897156Z/3,/Table/11/1/53/1/2017-11-09T07:29:17.897156Z/4)
2017-11-09 14:15:43: W171109 14:15:43.229340 30547311 kv/dist_sender.go:1190  [n3] have been waiting 1m0s sending RPC to r6 for batch: [txn: 5edc82d5], BeginTransaction [/Table/11/1/53/1/2017-11-09T03:45:30.892219Z/3,/Min), DeleteRange [/Table/11/1/53/1/2017-11-09T03:45:30.892219Z/3,/Table/11/1/53/1/2017-11-09T03:45:30.892219Z/4)
2017-11-09 14:15:43: W171109 14:15:43.240201 30547400 kv/dist_sender.go:1190  [n3] have been waiting 1m0s sending RPC to r6 for batch: [txn: 53e11559], BeginTransaction [/Table/11/1/53/1/2017-11-09T09:54:45.187178Z/3,/Min), DeleteRange [/Table/11/1/53/1/2017-11-09T09:54:45.187178Z/3,/Table/11/1/53/1/2017-11-09T09:54:45.187178Z/4)
2017-11-09 14:15:43: W171109 14:15:43.254916 30547962 kv/dist_sender.go:1190  [n3] have been waiting 1m0s sending RPC to r6 for batch: [txn: a5378025], BeginTransaction [/Table/11/1/55/12/2017-11-03T23:26:57.827931Z/3,/Min), DeleteRange [/Table/11/1/55/12/2017-11-03T23:26:57.827931Z/3,/Table/11/1/55/12/2017-11-03T23:26:57.827931Z/4)
2017-11-09 14:15:43: W171109 14:15:43.269100 30548310 kv/dist_sender.go:1190  [n3] have been waiting 1m0s sending RPC to r6 for batch: [txn: e3f0c7ba], BeginTransaction [/Table/11/1/55/12/2017-11-04T10:04:30.030217Z/3,/Min), DeleteRange [/Table/11/1/55/12/2017-11-04T10:04:30.030217Z/3,/Table/11/1/55/12/2017-11-04T10:04:30.030217Z/4)
2017-11-09 14:15:43: W171109 14:15:43.294618 30548877 kv/dist_sender.go:1190  [n3] have been waiting 1m0s sending RPC to r6 for batch: [txn: 09aac5dc], BeginTransaction [/Table/11/1/55/12/20
...

Followed by many hundreds more of these BeginTransaction logs until around 14:20:44:

...
7-11-06T13:08:28.977543Z/3,/Min), DeleteRange [/Table/11/1/56/1/2017-11-06T13:08:28.977543Z/3,/Table/11/1/56/1/2017-11-06T13:08:28.977543Z/4)
2017-11-09 14:20:42: W171109 14:20:42.186832 30552148 kv/dist_sender.go:1190  [n3] have been waiting 1m0s sending RPC to r6 for batch: [txn: 3f8adc82], BeginTransaction [/Table/11/1/55/12/20
17-11-09T13:52:27.45156Z/3,/Min), DeleteRange [/Table/11/1/55/12/2017-11-09T13:52:27.45156Z/3,/Table/11/1/55/12/2017-11-09T13:52:27.45156Z/4)
2017-11-09 14:20:42: W171109 14:20:42.192140 30555163 kv/dist_sender.go:1190  [n3] have been waiting 1m0s sending RPC to r6 for batch: [txn: 31b0794e], BeginTransaction [/Table/11/1/56/1/201
7-11-07T21:31:36.01715Z/3,/Min), DeleteRange [/Table/11/1/56/1/2017-11-07T21:31:36.01715Z/3,/Table/11/1/56/1/2017-11-07T21:31:36.01715Z/4)
2017-11-09 14:20:42: W171109 14:20:42.488627 30554493 kv/dist_sender.go:1190  [n3] have been waiting 1m0s sending RPC to r6 for batch: [txn: 9f9d094b], BeginTransaction [/Table/11/1/56/1/201
7-11-07T00:32:02.271335Z/3,/Min), DeleteRange [/Table/11/1/56/1/2017-11-07T00:32:02.271335Z/3,/Table/11/1/56/1/2017-11-07T00:32:02.271335Z/4)
2017-11-09 14:20:42: W171109 14:20:42.576577 30554078 kv/dist_sender.go:1190  [n3] have been waiting 1m0s sending RPC to r6 for batch: [txn: 0e5f638b], BeginTransaction [/Table/11/1/56/1/201
7-11-06T07:39:44.77767Z/3,/Min), DeleteRange [/Table/11/1/56/1/2017-11-06T07:39:44.77767Z/3,/Table/11/1/56/1/2017-11-06T07:39:44.77767Z/4)
2017-11-09 14:20:42: W171109 14:20:42.926683 30553601 kv/dist_sender.go:1190  [n3] have been waiting 1m0s sending RPC to r6 for batch: [txn: 70228518], BeginTransaction [/Table/11/1/56/1/201
7-11-05T16:59:58.766956Z/3,/Min), DeleteRange [/Table/11/1/56/1/2017-11-05T16:59:58.766956Z/3,/Table/11/1/56/1/2017-11-05T16:59:58.766956Z/4)
2017-11-09 14:20:43: W171109 14:20:43.300941 30595657 kv/dist_sender.go:1190  [client=17.121.4.3:38462,user=root,n3] have been waiting 1m0s sending RPC to r6 for batch: [txn: acd8b404], Begi
nTransaction [/Table/11/1/55/12/2017-11-09T22:22:40.429934Z/3/0,/Min), ConditionalPut [/Table/11/1/55/12/2017-11-09T22:22:40.429934Z/3/0,/Min)
2017-11-09 14:20:44: W171109 14:20:44.455552 30547366 kv/dist_sender.go:1190  [n3] have been waiting 1m0s sending RPC to r6 for batch: [txn: 2d8e6329], EndTransaction [/Table/11/1/53/1/2017-
11-09T07:29:17.897156Z/3,/Min)
2017-11-09 14:20:44: I171109 14:20:44.467219 501 gossip/gossip.go:448  [n3] gossip status (ok, 5 nodes)
2017-11-09 14:20:44: gossip client (1/3 cur/max conns)
2017-11-09 14:20:44: 1: 17.121.4.196:26257 (11m13s: infos 91/6619 sent/received, bytes 30831B/728760B sent/received)
2017-11-09 14:20:44: gossip server (0/3 cur/max conns, infos 11147/6665640 sent/received, bytes 2412492B/668196405B sent/received)
2017-11-09 14:20:44: I171109 14:20:44.468086 502 server/status/runtime.go:225  [n3] runtime stats: 1.9 GiB RSS, 15897 goroutines, 764 MiB/139 MiB/1.4 GiB GO alloc/idle/total, 135 MiB/316 MiB
 CGO alloc/total, 747.43cgo/sec, 0.08/0.03 %(u/s)time, 0.00 %gc (0x)
2017-11-09 14:20:44: W171109 14:20:44.482135 30547311 kv/dist_sender.go:1190  [n3] have been waiting 1m0s sending RPC to r6 for batch: [txn: 5edc82d5], EndTransaction [/Table/11/1/53/1/2017-
11-09T03:45:30.892219Z/3,/Min)
2017-11-09 14:20:44: W171109 14:20:44.482194 30547400 kv/dist_sender.go:1190  [n3] have been waiting 1m0s sending RPC to r6 for batch: [txn: 53e11559], EndTransaction [/Table/11/1/53/1/2017-
11-09T09:54:45.187178Z/3,/Min)
2017-11-09 14:20:44: W171109 14:20:44.482216 30543758 kv/dist_sender.go:1190  [n3] have been waiting 1m0s sending RPC to r6 for batch: [txn: b1ad86fa], EndTransaction [/Table/11/1/52/1/2017-11-08T02:35:00.022268Z/3,/Min)
2017-11-09 14:20:44: W171109 14:20:44.482493 30547962 kv/dist_sender.go:1190  [n3] have been waiting 1m0s sending RPC to r6 for batch: [txn: a5378025], EndTransaction [/Table/11/1/55/12/2017-11-03T23:26:57.827931Z/3,/Min)
2017-11-09 14:20:44: W171109 14:20:44.587456 30548310 kv/dist_sender.go:1190  [n3] have been waiting 1m0s sending RPC to r6 for batch: [txn: e3f0c7ba], EndTransaction [/Table/11/1/55/12/2017-11-04T10:04:30.030217Z/3,/Min)
2017-11-09 14:20:44: W171109 14:20:44.670217 30548877 kv/dist_sender.go:1190  [n3] have been waiting 1m0s sending RPC to r6 for batch: [txn: 09aac5dc], EndTransaction [/Table/11/1/55/12/2017-11-04T20:37:46.02012Z/3,/Min)
2017-11-09 14:20:44: W171109 14:20:44.694555 30549052 kv/dist_sender.go:1190  [n3] have been waiting 1m0s sending RPC to r6 for batch: [txn: 3bff38d8], EndTransaction [/Table/11/1/55/12/2017-11-05T03:11:09.98455Z/3,/Min)
...

The restarted server has booted at this point and its cockroachdb instance has rejoined the cluster (gossip status is now (ok, 5 nodes)) and we start seeing lots of EndTransaction logs:

...
2017-11-09 14:44:55: W171109 14:44:55.786976 30702533 kv/dist_sender.go:1190  [n3] have been waiting 1m0s sending RPC to r6 for batch: [txn: f8055627], HeartbeatTxn [/Table/11/1/56/1/2017-11-04T16:38:48.547356Z/3,/Min)
2017-11-09 14:44:55: W171109 14:44:55.827951 30703279 kv/dist_sender.go:1190  [n3] have been waiting 1m0s sending RPC to r6 for batch: [txn: b9e14a6c], HeartbeatTxn [/Table/11/1/56/1/2017-11-09T19:39:25.080944Z/3,/Min)
2017-11-09 14:44:55: W171109 14:44:55.827985 30703434 kv/dist_sender.go:1190  [n3] have been waiting 1m0s sending RPC to r6 for batch: [txn: c25955fe], HeartbeatTxn [/Table/11/1/53/1/2017-11-08T23:31:00.399443Z/3,/Min)
2017-11-09 14:44:55: W171109 14:44:55.858130 30555148 kv/dist_sender.go:1190  [n3] have been waiting 1m0s sending RPC to r6 for batch: [txn: d4583ae2], EndTransaction [/Table/11/1/56/1/2017-11-07T20:29:05.636352Z/3,/Min)
2017-11-09 14:44:55: W171109 14:44:55.858210 30546246 kv/dist_sender.go:1190  [n3] have been waiting 1m0s sending RPC to r6 for batch: [txn: b1f0890d], EndTransaction [/Table/11/1/53/1/2017-11-06T13:00:53.926851Z/3,/Min)
...

This leads into another giant batch of logs about failed heartbeats:

...
2017-11-09 14:48:33: W171109 14:48:33.149608 30729047 kv/dist_sender.go:1190  [n3] have been waiting 1m0s sending RPC to r6 for batch: [txn: bd9397a2], HeartbeatTxn [/Table/11/1/52/1/2017-11
-04T00:11:10.217563Z/3,/Min)
2017-11-09 14:48:33: W171109 14:48:33.290415 30729050 kv/dist_sender.go:1190  [n3] have been waiting 1m0s sending RPC to r6 for batch: [txn: 32522fe6], HeartbeatTxn [/Table/11/1/53/1/2017-11
-07T09:48:23.724429Z/3,/Min)
2017-11-09 14:48:33: W171109 14:48:33.370784 30728799 kv/dist_sender.go:1190  [n3] have been waiting 1m0s sending RPC to r6 for batch: [txn: 14430eff], HeartbeatTxn [/Table/11/1/53/1/2017-11
-05T20:49:23.972232Z/3,/Min)
2017-11-09 14:48:34: I171109 14:48:34.466985 502 server/status/runtime.go:225  [n3] runtime stats: 1.9 GiB RSS, 30095 goroutines, 511 MiB/343 MiB/1.5 GiB GO alloc/idle/total, 136 MiB/316 MiB
 CGO alloc/total, 75.20cgo/sec, 0.07/0.05 %(u/s)time, 0.00 %gc (0x)
2017-11-09 14:48:36: W171109 14:48:36.019060 30696663 kv/txn_coord_sender.go:817  [n3] heartbeat to "unnamed" id=fb077c63 key=/Table/11/1/56/1/2017-11-08T09:27:28.314241Z/3 rw=true pri=0.038
83545 iso=SERIALIZABLE stat=PENDING epo=0 ts=1510265680.258179221,0 orig=1510265680.258179221,0 max=1510265680.758179221,0 wto=false rop=true seq=13 failed: context canceled
2017-11-09 14:48:36: W171109 14:48:36.454448 30674194 kv/txn_coord_sender.go:817  [n3] heartbeat to "unnamed" id=1c1ad3da key=/Table/11/1/56/1/2017-11-04T21:46:18.190642Z/3 rw=true pri=0.034
11444 iso=SERIALIZABLE stat=PENDING epo=0 ts=1510265680.241679614,0 orig=1510265680.241679614,0 max=1510265680.741679614,0 wto=false rop=false seq=13 failed: context canceled
2017-11-09 14:48:36: W171109 14:48:36.487785 30647944 kv/txn_coord_sender.go:817  [n3] heartbeat to "unnamed" id=305a2b99 key=/Table/11/1/56/1/2017-11-04T07:57:39.202673Z/3 rw=true pri=0.003
38629 iso=SERIALIZABLE stat=PENDING epo=0 ts=1510265680.239055004,0 orig=1510265680.239055004,0 max=1510265680.739055004,0 wto=false rop=false seq=13 failed: context canceled
2017-11-09 14:48:36: W171109 14:48:36.761489 30671445 kv/txn_coord_sender.go:817  [n3] heartbeat to "unnamed" id=5f9c4d54 key=/Table/11/1/55/12/2017-11-04T14:43:53.345749Z/3 rw=true pri=0.02
097283 iso=SERIALIZABLE stat=PENDING epo=0 ts=1510265680.165047503,0 orig=1510265680.165047503,0 max=1510265680.665047503,0 wto=false rop=false seq=13 failed: context canceled
2017-11-09 14:48:37: W171109 14:48:37.183674 30660082 kv/txn_coord_sender.go:817  [n3] heartbeat to "unnamed" id=a4d97c6d key=/Table/11/1/55/12/2017-11-05T20:46:25.781629Z/3 rw=true pri=0.02
819351 iso=SERIALIZABLE stat=PENDING epo=0 ts=1510265680.172254789,0 orig=1510265680.172254789,0 max=1510265680.672254789,0 wto=false rop=false seq=13 failed: context canceled
2017-11-09 14:48:37: W171109 14:48:37.368956 30697106 kv/txn_coord_sender.go:817  [n3] heartbeat to "unnamed" id=f357a651 key=/Table/11/1/53/1/2017-11-07T16:28:50.317093Z/3 rw=true pri=0.011
07552 iso=SERIALIZABLE stat=PENDING epo=0 ts=1510265680.155465298,0 orig=1510265680.155465298,0 max=1510265680.655465298,0 wto=false rop=true seq=13 failed: context canceled
2017-11-09 14:48:37: W171109 14:48:37.440934 30647564 kv/txn_coord_sender.go:817  [n3] heartbeat to "unnamed" id=d8835c4a key=/Table/11/1/55/12/2017-11-07T19:29:03.660453Z/3 rw=true pri=0.02
064067 iso=SERIALIZABLE stat=PENDING epo=0 ts=1510265680.218079650,0 orig=1510265680.218079650,0 max=1510265680.718079650,0 wto=false rop=false seq=13 failed: context canceled
2017-11-09 14:48:37: W171109 14:48:37.565051 30670874 kv/txn_coord_sender.go:817  [n3] heartbeat to "unnamed" id=d8249354 key=/Table/11/1/55/12/2017-11-06T23:15:10.282753Z/3 rw=true pri=0.00
238298 iso=SERIALIZABLE stat=PENDING epo=0 ts=1510265680.214969204,0 orig=1510265680.214969204,0 max=1510265680.714969204,0 wto=false rop=false seq=13 failed: context canceled
2017-11-09 14:48:37: W171109 14:48:37.579991 30670798 kv/txn_coord_sender.go:817  [n3] heartbeat to "unnamed" id=4200e3ca key=/Table/11/1/56/1/2017-11-06T07:21:14.104388Z/3 rw=true pri=0.016
44180 iso=SERIALIZABLE stat=PENDING epo=0 ts=1510265680.248550489,0 orig=1510265680.248550489,0 max=1510265680.748550489,0 wto=false rop=false seq=13 failed: context canceled
2017-11-09 14:48:37: W171109 14:48:37.959982 30697124 kv/txn_coord_sender.go:817  [n3] heartbeat to "unnamed" id=647b439e key=/Table/11/1/56/1/2017-11-08T20:24:44.62044Z/3 rw=true pri=0.0638
...

And then it all goes back to normal by the last line below:

...
2017-11-09 15:24:03: W171109 15:24:03.398937 30769512 kv/txn_coord_sender.go:817  [n3] heartbeat to "unnamed" id=0dde013c key=/Table/11/1/53/1/2017-11-06T20:36:16.057396Z/3 rw=true pri=0.009
80929 iso=SERIALIZABLE stat=PENDING epo=0 ts=1510265680.153940297,0 orig=1510265680.153940297,0 max=1510265680.653940297,0 wto=false rop=false seq=13 failed: context canceled
2017-11-09 15:24:03: W171109 15:24:03.509478 30756429 kv/txn_coord_sender.go:817  [n3] heartbeat to "unnamed" id=9131b3ad key=/Table/11/1/52/1/2017-11-05T21:09:59.207196Z/3 rw=true pri=0.010
04897 iso=SERIALIZABLE stat=PENDING epo=0 ts=1510265680.134422146,0 orig=1510265680.134422146,0 max=1510265680.634422146,0 wto=false rop=false seq=13 failed: context canceled
2017-11-09 15:24:03: W171109 15:24:03.548690 30764369 kv/txn_coord_sender.go:817  [n3] heartbeat to "unnamed" id=f9e36ea8 key=/Table/11/1/53/1/2017-11-04T08:52:04.362469Z/3 rw=true pri=0.007
77552 iso=SERIALIZABLE stat=PENDING epo=0 ts=1510265680.146961687,0 orig=1510265680.146961687,0 max=1510265680.646961687,0 wto=false rop=false seq=13 failed: context canceled
2017-11-09 15:24:04: I171109 15:24:04.467578 502 server/status/runtime.go:225  [n3] runtime stats: 1.1 GiB RSS, 483 goroutines, 216 MiB/309 MiB/684 MiB GO alloc/idle/total, 138 MiB/315 MiB C
GO alloc/total, 1036.38cgo/sec, 0.16/0.02 %(u/s)time, 0.00 %gc (1x)
2017-11-09 15:24:05: W171109 15:24:05.149119 30756833 kv/txn_coord_sender.go:817  [n3] heartbeat to "unnamed" id=e14870c1 key=/Table/11/1/53/1/2017-11-04T00:58:44.297663Z/3 rw=true pri=0.019
89640 iso=SERIALIZABLE stat=PENDING epo=0 ts=1510265680.146892124,0 orig=1510265680.146892124,0 max=1510265680.646892124,0 wto=false rop=false seq=13 failed: context canceled
2017-11-09 15:24:13: W171109 15:24:13.107601 30813126 kv/txn_coord_sender.go:817  [client=17.121.4.3:38462,user=root,n3] heartbeat to "unnamed" id=d4f0aa82 key=/Table/11/1/55/12/2017-11-09T2
3:28:51.872361Z/3/0 rw=true pri=0.00296137 iso=SERIALIZABLE stat=PENDING epo=0 ts=1510269810.398288667,0 orig=1510269810.398288667,0 max=1510269810.898288667,0 wto=false rop=false seq=7 fail
ed: context canceled
2017-11-09 15:24:14: I171109 15:24:14.466557 502 server/status/runtime.go:225  [n3] runtime stats: 1.1 GiB RSS, 477 goroutines, 170 MiB/363 MiB/689 MiB GO alloc/idle/total, 140 MiB/317 MiB C
GO alloc/total, 1104.11cgo/sec, 0.18/0.04 %(u/s)time, 0.00 %gc (1x)
2017-11-09 15:24:24: I171109 15:24:24.467379 502 server/status/runtime.go:225  [n3] runtime stats: 1.1 GiB RSS, 479 goroutines, 192 MiB/346 MiB/691 MiB GO alloc/idle/total, 137 MiB/320 MiB C
GO alloc/total, 2468.64cgo/sec, 0.22/0.03 %(u/s)time, 0.00 %gc (1x)
2017-11-09 15:24:29: W171109 15:24:29.449864 30815432 kv/txn_coord_sender.go:817  [client=17.121.4.3:38122,user=root,n3] heartbeat to "unnamed" id=e82814b9 key=/Table/11/1/56/1/2017-11-09T23
:28:03.796885Z/3/0 rw=true pri=0.03006533 iso=SERIALIZABLE stat=PENDING epo=0 ts=1510269854.366335562,0 orig=1510269854.366335562,0 max=1510269854.866335562,0 wto=false rop=false seq=7 faile
d: context canceled
2017-11-09 15:24:34: I171109 15:24:34.466901 502 server/status/runtime.go:225  [n3] runtime stats: 1.2 GiB RSS, 477 goroutines, 209 MiB/360 MiB/702 MiB GO alloc/idle/total, 139 MiB/320 MiB C
GO alloc/total, 2667.27cgo/sec, 0.48/0.10 %(u/s)time, 0.00 %gc (2x)
2017-11-09 15:24:42: W171109 15:24:42.548786 30824220 kv/txn_coord_sender.go:817  [client=17.121.4.3:38566,user=root,n3] heartbeat to "unnamed" id=99f476d4 key=/Table/11/1/54/1/2017-11-09T23
:28:49.245068Z/3/0 rw=true pri=0.00671376 iso=SERIALIZABLE stat=PENDING epo=0 ts=1510269870.075877039,0 orig=1510269870.075877039,0 max=1510269870.575877039,0 wto=false rop=false seq=7 faile
d: context canceled
2017-11-09 15:24:42: I171109 15:24:42.561166 30825670 sql/mon/mem_usage.go:520  [n3,HashJoiner] distsql: memory usage increases to 20 KiB (+10240)
2017-11-09 15:24:42: I171109 15:24:42.561257 30825691 sql/mon/mem_usage.go:520  [n3,HashJoiner] distsql: memory usage increases to 40 KiB (+10240)
2017-11-09 15:24:42: I171109 15:24:42.563878 30825676 sql/mon/mem_usage.go:520  [n3,HashJoiner] distsql: memory usage increases to 20 KiB (+10240)
2017-11-09 15:24:44: I171109 15:24:44.463715 501 gossip/gossip.go:448  [n3] gossip status (ok, 5 nodes)
2017-11-09 15:24:44: gossip client (1/3 cur/max conns)
2017-11-09 15:24:44: 1: 17.121.4.196:26257 (1h15m13s: infos 158/51831 sent/received, bytes 45957B/5180649B sent/received)
2017-11-09 15:24:44: gossip server (0/3 cur/max conns, infos 11214/6710852 sent/received, bytes 2427618B/672648294B sent/received)
2017-11-09 15:24:44: I171109 15:24:44.468729 502 server/status/runtime.go:225  [n3] runtime stats: 1.1 GiB RSS, 471 goroutines, 230 MiB/312 MiB/685 MiB GO alloc/idle/total, 136 MiB/317 MiB C
GO alloc/total, 2950.66cgo/sec, 0.41/0.07 %(u/s)time, 0.00 %gc (2x)
2017-11-09 15:24:54: I171109 15:24:54.466482 502 server/status/runtime.go:225  [n3] runtime stats: 1.1 GiB RSS, 471 goroutines, 264 MiB/278 MiB/686 MiB GO alloc/idle/total, 137 MiB/317 MiB C
GO alloc/total, 958.82cgo/sec, 0.22/0.03 %(u/s)time, 0.00 %gc (1x)
2017-11-09 15:25:04: I171109 15:25:04.466070 502 server/status/runtime.go:225  [n3] runtime stats: 1.1 GiB RSS, 471 goroutines, 158 MiB/375 MiB/687 MiB GO alloc/idle/total, 138 MiB/318 MiB CGO alloc/total, 1000.24cgo/sec, 0.29/0.04 %(u/s)time, 0.00 %gc (2x)
2017-11-09 15:25:14: I171109 15:25:14.467319 502 server/status/runtime.go:225  [n3] runtime stats: 1.1 GiB RSS, 471 goroutines, 161 MiB/369 MiB/688 MiB GO alloc/idle/total, 139 MiB/318 MiB C
...
@a-robinson
Copy link
Contributor

Thanks for sharing, @gpaul. Would you be willing to share the full log files with me? If you don't want to share them publicly, you can always send them to me at [email protected]. This isn't ringing any bells for me on first glance, and being able to piece more of the info together should help.

I assume this is the only time you've hit this problem?

@gpaul
Copy link
Author

gpaul commented Nov 20, 2017

Hi Alex, I believe the same customer ran into this situation twice. Neither us nor them have been able to reproduce it since. I'll email you the full logs. Thanks for taking a look at this.

@a-robinson
Copy link
Contributor

Thanks for sending the logs. I haven't made any conclusions yet, but am going to jot down some thoughts as I go.

  • Mesos took more than 5 minutes to bring the restarted node back up, which made me suspicious that this is related to stability: stuck requests after rebalancing-inducing downtime #19165. However, it looks like timeseries writes started stalling as soon as the process went down, not after it came back up, which doesn't match the same symptoms. That either means it's not the same problem or there are two problems intertwined here.
  • The node that got restarted had very few stuck requests on it after it restarted other than timeseries writes. This also points toward stability: stuck requests after rebalancing-inducing downtime #19165, unless it's just that way fewer requests were sent to the restarted node.
  • The node that got restarted finished applying snapshots to catch up just minutes after it restarted, and it doesn't appear to have had a long queue of replicas it needed to GC, making stability: stuck requests after rebalancing-inducing downtime #19165 less likely.
  • The pattern you refer to of queries gradually getting slower and slower is a very uncommon symptom in cockroach. It'd be much less surprising if it just immediately started hanging for more and more requests.
  • The 100MiB cache size is much smaller than we normally test with. I expect it's fine, but maybe it doesn't work well ranges that manage to get above 100MiB in size? This is probably nothing.
  • One thing that's quite weird and seems worth noting is that two of the nodes' raft transports never reconnected to the restarted node after it came back up. I don't know what that implies:
$ grep "raft transport stream to node" */*
17.121.4.134_master/dcos-cockroach.service:2017-11-09 14:14:34: I171109 14:14:34.927390 934 storage/raft_transport.go:436  [n2] raft transport stream to node 5 established
17.121.4.134_master/dcos-cockroach.service:2017-11-09 14:14:36: I171109 14:14:36.486272 895 storage/raft_transport.go:436  [n2] raft transport stream to node 3 established
17.121.4.134_master/dcos-cockroach.service:2017-11-09 14:14:39: I171109 14:14:39.501746 1012 storage/raft_transport.go:436  [n2] raft transport stream to node 1 established
17.121.4.134_master/dcos-cockroach.service:2017-11-09 14:14:39: I171109 14:14:39.501777 1011 storage/raft_transport.go:436  [n2] raft transport stream to node 4 established
17.121.4.195_master/dcos-cockroach.service:2017-11-09 14:09:31: W171109 14:09:31.386146 691 storage/raft_transport.go:442  [n5] raft transport stream to node 2 failed: rpc error: code = Internal desc = transport is closing
17.121.4.195_master/dcos-cockroach.service:2017-11-09 14:14:35: I171109 14:14:35.286096 19384450 storage/raft_transport.go:436  [n5] raft transport stream to node 2 established
17.121.4.196_master/dcos-cockroach.service:2017-11-09 14:09:32: W171109 14:09:32.641521 674 storage/raft_transport.go:442  [n1] raft transport stream to node 2 failed: rpc error: code = Internal desc = transport is closing
17.121.4.3_master/dcos-cockroach.service:2017-11-09 14:09:31: W171109 14:09:31.893605 608 storage/raft_transport.go:442  [n3] raft transport stream to node 2 failed: rpc error: code = Internal desc = transport: write tcp 17.121.4.3:33050->17.121.4.134:26257: use of closed network connection
17.121.4.3_master/dcos-cockroach.service:2017-11-09 14:14:36: I171109 14:14:36.762984 30541008 storage/raft_transport.go:436  [n3] raft transport stream to node 2 established
17.121.4.66_master/dcos-cockroach.service:2017-11-09 14:09:32: W171109 14:09:32.435963 721 storage/raft_transport.go:442  [n4] raft transport stream to node 2 failed: rpc error: code = Internal desc = transport is closing

I have a few questions for you that might help narrow things down:

  1. Do you know whether you were sending as many requests to the node that restarted as to all the other nodes?
  2. Did all requests to the database hang for the entire 1+ hour period, or just some of them?
  3. Do you know whether anything changed externally to cockroach when the queries started succeeding again? A client connection closing or anything like that?
  4. Approximately how much data is on each node?
  5. Do you have access to the admin UI graphs or the exported metrics anywhere? I'm curious if there were any "Under-replicated" ranges just before the node was restarted. The "Ranges" graph on the "REPLICATION" dashboard in the admin UI would have that stat.

@gpaul
Copy link
Author

gpaul commented Nov 22, 2017

Thanks for the deep dive @a-robinson. I'll try and get the requested information.

@gpaul
Copy link
Author

gpaul commented Nov 22, 2017

Do you know whether you were sending as many requests to the node that restarted as to all the other nodes?

The reboot nuked the logs from the server we restarted. I suspect it may have seen a somewhat higher request rate given that it was the mesos leader, but not much.

Did all requests to the database hang for the entire 1+ hour period, or just some of them?

All queries hung.

Do you know whether anything changed externally to cockroach when the queries started succeeding again? A client connection closing or anything like that?

It is very unlikely that any connection changed - there is a single client per cockroachdb instance and nothing external had any effect.

Approximately how much data is on each node?

There is very little data in the database. On the order of ~100-1,000 records in total.

Do you have access to the admin UI graphs or the exported metrics anywhere? I'm curious if there were any "Under-replicated" ranges just before the node was restarted. The "Ranges" graph on the "REPLICATION" dashboard in the admin UI would have that stat.

We may be able to get such metrics next week once everyone is back from the holidays.

@gpaul
Copy link
Author

gpaul commented Nov 22, 2017

It really looks a lot like #19165 - could you provide me with the steps to reproduce that issue? I'm going to need to provide root cause analysis of this issue and a likely culprit combined with a reproduction thereof would go a long way.

@a-robinson
Copy link
Contributor

The reboot nuked the logs from the server we restarted. I suspect it may have seen a somewhat higher request rate given that it was the mesos leader, but not much.

I was mostly curious about after the restart - was it getting a similar amount of requests? One of the main symptoms of #19165 was that the restarted node could serve requests but all of the other would hang.

It really looks a lot like #19165 - could you provide me with the steps to reproduce that issue? I'm going to need to provide root cause analysis of this issue and a likely culprit combined with a reproduction thereof would go a long way.

I'm still not convinced that it's the same problem - getting more metrics from before/during/after the incident would be really helpful. But if you want to reproduce that, the steps are:

  1. Create a cluster of at least 4 nodes clients sending requests to each node. In the original experiment the load was likely the kv load generator being run on each VM, connected to the local cockroach process, with the flags --concurrency 64 --tolerate-errors --min-block-bytes=8 --max-block-bytes=128 --read-percent=50.
  2. Stop a node
  3. Wait 10 minutes
  4. Restart the node
  5. Wait 10 minutes
  6. Repeat steps 2-5 until you see queries hang on all non-restarted nodes

@a-robinson
Copy link
Contributor

cc @m-schneider @tschottdorf in case they have any more to say based on their looking into #19165.

@gpaul
Copy link
Author

gpaul commented Nov 23, 2017

I was mostly curious about after the restart - was it getting a similar amount of requests? One of the main symptoms of #19165 was that the restarted node could serve requests but all of the other would hang.

OK, those logs I have. The restarted instance could definitely not serve any requests while the others were unavailable.

@petermattis
Copy link
Collaborator

I did some spelunking of the logs. During the time period the node was down (14:09:31 to 14:14:34) I'm seeing a modest amount of rebalance activity (59 applied preemptive snapshot messages). This is too be expected and it mostly comes during the tail of that down time period. After the node comes back up, we rebalance 22 replicas back on to it. This too seems fine and doesn't explain anything. One bit to note is that these were all timeseries ranges. Not that it matters much, but it does highlight that the majority of the data in the cluster is likely internal timeseries data.

After n5 restarted, there were only a handful of replicas removed by replicaGC. This is curious. We seem to have rebalanced a lot of ranges away from n5 which should have resulted in more removals. But perhaps the rebalances during n5s down period were for ranges not on n5. That would be very coincidental. Apologies for the speculation. The combination of rebalance and removals does seem similar to #19165.

Like @a-robinson, I noticed that n1 and n4 did not open up Raft transport connections to n5 after it came back up. This is a bit unexpected. Either it is related to the issue, or n1 and n4 were not the leader for any range for which n5 also had a replica.

@petermattis
Copy link
Collaborator

@gpaul provided logs and debug zip from another reproduction. The cluster has only 6 user tables and each of those tables has only one range. The range IDs are r23-r28.

Similar to the previous incident, a node was taken down and restarted after ~5m. At 08:33:10, the other nodes in the cluster saw n3 go down. At 08:38:18 n3 restarted.

Rebalancing of replicas away from n3 started at 08:37:21. That's 4m11s after n3 went down, but I think that is explainable as the cluster is lightly loaded so the most recent periodic gossip from n3 might have been somewhat before it died.

Very shortly after n3 restarts, the client logs show that requests start taking a very long time before freezing completely. Cpu profiles and logs show a lot of range lease activity. Some of this is to be expected because we're rebalancing ranges and leases onto n3. Unexpected is a mass of slow node liveness heartbeats. These occur both when n3 goes down and when it comes back up.

Of interest to the user are the ranges corresponding to their tables. The leases for r23, r27 and r28 are acquired by n3 very quickly after it restarts. r27 has the most user data so I'm going to focus on it below:

I171130 08:38:33.612628 218 storage/replica_proposal.go:415  [n3,s3,r27/3:/Table/5{5-6}] new range lease repl=(n3,s3):3 start=1512059913.460686742,0 epo=11 pro=1512059913.460689822,0 following repl=(n4,s4):2 start=1510699521.466573404,0 epo=11 pro=1510699521.466576036,0
W171130 08:39:11.643200 7764 storage/node_liveness.go:291  [n3,s3,r27/3:/Table/5{5-6}] slow heartbeat took 11.0s
E171130 08:39:11.643228 7764 storage/replica_range_lease.go:192  [n3,s3,r27/3:/Table/5{5-6}] heartbeat failed on epoch increment
I171130 08:39:11.957645 213 storage/replica_proposal.go:415  [n3,s3,r27/3:/Table/5{5-6}] new range lease repl=(n3,s3):3 start=1512059913.460686742,0 epo=11 pro=1512059951.643427231,0 following repl=(n3,s3):3 start=1512059913.460686742,0 epo=11 pro=1512059913.460689822,0
I171130 08:39:21.131541 288 storage/replica_proposal.go:415  [n3,s3,r27/3:/Table/5{5-6}] new range lease repl=(n3,s3):3 start=1512059913.460686742,0 epo=11 pro=1512059960.887661218,0 following repl=(n3,s3):3 start=1512059913.460686742,0 epo=11 pro=1512059951.643427231,0
I171130 08:39:21.491269 335 storage/replica_proposal.go:415  [n3,s3,r27/3:/Table/5{5-6}] new range lease repl=(n3,s3):3 start=1512059913.460686742,0 epo=11 pro=1512059961.131782919,0 following repl=(n3,s3):3 start=1512059913.460686742,0 epo=11 pro=1512059960.887661218,0
W171130 08:39:35.262271 8208 storage/node_liveness.go:291  [n3,s3,r27/3:/Table/5{5-6}] slow heartbeat took 13.8s
E171130 08:39:35.262296 8208 storage/replica_range_lease.go:192  [n3,s3,r27/3:/Table/5{5-6}] heartbeat failed on epoch increment
W171130 08:39:37.103342 8273 storage/node_liveness.go:291  [n3,s3,r27/3:/Table/5{5-6}] slow heartbeat took 1.8s
I171130 08:39:37.203680 331 storage/replica_proposal.go:415  [n3,s3,r27/3:/Table/5{5-6}] new range lease repl=(n3,s3):3 start=1512059913.460686742,0 epo=11 pro=1512059975.262510448,0 following repl=(n3,s3):3 start=1512059913.460686742,0 epo=11 pro=1512059961.131782919,0
W171130 08:40:06.921338 9200 storage/node_liveness.go:291  [n3,s3,r27/3:/Table/5{5-6}] slow heartbeat took 3.8s
E171130 08:40:06.921354 9200 storage/replica_range_lease.go:192  [n3,s3,r27/3:/Table/5{5-6}] heartbeat failed on epoch increment
I171130 08:40:07.867422 464 storage/replica_proposal.go:415  [n3,s3,r27/3:/Table/5{5-6}] new range lease repl=(n3,s3):3 start=1512059913.460686742,0 epo=11 pro=1512060006.921938815,0 following repl=(n3,s3):3 start=1512059913.460686742,0 epo=11 pro=1512059975.262510448,0

The logs from all nodes indicate lots of slow node liveness heartbeats until 09:28. There are then a smattering of slow node liveness heartbeats until 10:19 at which point the cluster recovers.

Interestingly, the last successful request in the user's logs (from their application) is at 08:39:37. That time seems to correspond with n3 reacquiring the lease for r27. That successful request was very slow (25.2s) which lines up with this message on n4:

E171130 08:39:11.817504 360477877 storage/replica_range_lease.go:197 [n4,s4,r27/2:/Table/5{5-6}] mismatch incrementing epoch for node_id:3 epoch:11 expiration:<wall_time:1512059940426508201 logical:0 > ; actual is {NodeID:3 Epoch:11 Expiration:1512059944.818724691,0 Draining:false}

It is silly that we're printing out the liveness record in 2 different formats, but what that is saying is that n4 failed to increment the liveness record for n3 due to the record having already been updated (the timestamps correspond to 08:39:00 and 08:39:04).

Node liveness seems to be near the root of the issue. But why is node liveness having so many problems. Stepping away to grab some eats. Will continue debugging this.

@a-robinson
Copy link
Contributor

Interestingly, the last successful request in the user's logs (from their application) is at 08:39:37. That time seems to correspond with n3 reacquiring the lease for r27. That successful request was very slow (25.2s) which lines up with this message on n4:

It looks like node 1's application was the only one able to keep making progress until 08:39:37; two stopped at 08:38:01 and the other (that we have logs for) stopped at 08:39:21. Node 1 is also the only node where I see a slowdown before the stall; all others went from serving successful low-latency responses to hanging with no slowdown. They did all serve some very slow requests while n3 was down before it restarted, though (grep for "[1-9][0-9]\.[0-9]* s" in the client logs).

@petermattis
Copy link
Collaborator

They did all serve some very slow requests while n3 was down before it restarted, though (grep for "[1-9][0-9].[0-9]* s" in the client logs).

The slow requests around 18:33:31 correspond to when n3 went down. Those can be explained by n3 holding the node liveness range and needing to wait for the expiration-based lease for that range to timeout (9s).

@a-robinson
Copy link
Contributor

Yeah, although it's weird they're 14s-18s rather than, say, 10s.

@petermattis
Copy link
Collaborator

Looking at stack traces from m-005, I see the bulk of goroutines in:

4831: select [0~3 minutes] [Created by stop.(*Stopper).RunAsyncTask @ stopper.go:295]
    kv         dist_sender.go:1177        (*DistSender).sendToReplicas(#1375, #232905, *, #120, 0, #1376, 0x6, *, 0x3, 0x3, ...)
    kv         dist_sender.go:405         (*DistSender).sendRPC(#1375, #232905, *, 0x6, *, 0x3, 0x3, 0, 0, 0, ...)
    kv         dist_sender.go:474         (*DistSender).sendSingleRange(#1375, #232904, *, 0, 0, 0, 0, 0, 0, *, ...)
    kv         dist_sender.go:945         (*DistSender).sendPartialBatch(#1375, #232904, *, 0, 0, 0, 0, 0, 0, *, ...)
    kv         dist_sender.go:815         (*DistSender).divideAndSendBatchToRanges(#1375, #232904, *, 0, 0, 0, 0, 0, 0, *, ...)
    kv         dist_sender.go:629         (*DistSender).Send(#1375, #232904, *, 0, 0, 0, 0, 0, 0, *, ...)
    kv         txn_coord_sender.go:478    (*TxnCoordSender).Send(#1515, #232904, *, 0, 0, 0, 0, 0, 0, *, ...)
    client     db.go:528                  (*DB).send(*, #232905, *, 0, 0, 0, 0, 0, 0, *, ...)
    client     txn.go:865                 (*Txn).send(*, #232905, *, 0, 0, 0, 0, 0, 0, 0, ...)
    client     txn.go:435                 send)-fm(#232905, *, 0, 0, 0, 0, 0, 0, 0, 0, ...)
    client     db.go:436                  sendAndFill(#232905, *, *, *, 0, 0)
    client     txn.go:435                 (*Txn).Run(*, #232905, *, *, 0, 0)
    sql        tablewriter.go:544         (*tableDeleter).finalize(*, #232905, *, #19, *)
    sql        tablewriter.go:581         (*tableDeleter).fastDelete(*, #232905, *, *, 0, *, #232905)
    sql        delete.go:210              (*deleteNode).fastDelete(*, #232905, *, *, *, 0x1)
    sql        delete.go:126              (*deleteNode).Start(*, #232905, *, #82, *)
    sql        plan.go:242                (*planner).startPlan(*, #232905, *, #82, *, *, 0x4)
    sql        planner.go:256             (*planner).exec(*, #232905, *, #34, 0x57, *, 0x4, 0x4, 0, 0, ...)
    sql        lease.go:245               LeaseStore.Release.func1(#232905, *, *, 0, 0)
    client     db.go:490                  (*DB).Txn.func1(#232905, *, *, *, *, *)
    client     txn.go:641                 (*Txn).Exec(*, #232905, *, *, *, *, *)
    client     db.go:491                  (*DB).Txn(*, #232905, *, *, 0, 0)
    sql        lease.go:255               LeaseStore.Release(#56, #1515, #1242, #234538, #1242, #4, 0, 0, 0, 0, ...)
    sql        lease.go:795               (*tableState).removeLease.func1(#232902, #134)
    stop       stopper.go:294             (*Stopper).RunAsyncTask.func1(#1492, #232902, #134, #30, 0x3, 0x1, 0, 0, *)

Another large chunk are in a related piece of code:

4126: select [3~4 minutes] [Created by stop.(*Stopper).RunAsyncTask @ stopper.go:295]
    kv         dist_sender.go:1177        (*DistSender).sendToReplicas(#1375, #232905, *, #120, 0, #1376, 0x6, *, 0x3, 0x3, ...)
    kv         dist_sender.go:405         (*DistSender).sendRPC(#1375, #232905, *, 0x6, *, 0x3, 0x3, 0, 0, 0, ...)
    kv         dist_sender.go:474         (*DistSender).sendSingleRange(#1375, #232904, *, 0, 0, 0, 0, 0, 0, *, ...)
    kv         dist_sender.go:945         (*DistSender).sendPartialBatch(#1375, #232904, *, 0, 0, 0, 0, 0, 0, *, ...)
    kv         dist_sender.go:815         (*DistSender).divideAndSendBatchToRanges(#1375, #232904, *, 0, 0, 0, 0, 0, 0, *, ...)
    kv         dist_sender.go:629         (*DistSender).Send(#1375, #232904, *, 0, 0, 0, 0, 0, 0, *, ...)
    kv         txn_coord_sender.go:478    (*TxnCoordSender).Send(#1515, #232904, *, 0, 0, 0, 0, 0, 0, *, ...)
    client     db.go:528                  (*DB).send(*, #232905, *, 0, 0, 0, 0, 0, 0, *, ...)
    client     txn.go:865                 (*Txn).send(*, #232905, *, 0, 0, 0, 0, 0, 0, 0, ...)
    client     txn.go:546                 (*Txn).sendEndTxnReq(*, #232905, *, 0x56d301, 0, *)
    client     txn.go:439                 (*Txn).commit(*, #232905, *, *, *)
    client     txn.go:471                 (*Txn).Commit(*, #232905, *, *, 0)
    client     txn.go:656                 (*Txn).Exec(*, #232905, *, *, *, *, *)
    client     db.go:491                  (*DB).Txn(*, #232905, *, *, 0, 0)
    sql        lease.go:255               LeaseStore.Release(#56, #1515, #1242, #234538, #1242, #4, 0, 0, 0, 0, ...)
    sql        lease.go:795               (*tableState).removeLease.func1(#232902, #134)
    stop       stopper.go:294             (*Stopper).RunAsyncTask.func1(#1492, #232902, #134, #30, 0x3, 0x1, 0, 0, *)

m-003 and m-004 show similar. m-001 has relatively few goroutines.

m-002 shows different stacks:

2135: select [0~23 minutes] [Created by grpc.(*Server).serveStreams.func1 @ server.go:498]
    storage        replica.go:1740            (*Replica).beginCmds(#2543, #212523, *, *, *, 0x1, 0x8, #2552)
    storage        replica.go:2188            (*Replica).tryExecuteWriteBatch(#2543, #212523, *, *, 0, #240, 0x3, 0x6, 0, *, ...)
    storage        replica.go:2097            (*Replica).executeWriteBatch(#2543, #212523, *, *, 0, #240, 0x3, 0x6, 0, *, ...)
    storage        replica.go:1431            (*Replica).Send(#2543, #212523, *, *, 0, #240, 0x3, 0x6, 0, *, ...)
    storage        store.go:2555              (*Store).Send(#2162, #212523, *, *, 0, #240, 0x3, 0x6, 0, *, ...)
    storage        stores.go:187              (*Stores).Send(#2216, #212523, *, 0, 0, #240, 0x3, 0x6, 0, *, ...)
    server         node.go:843                (*Node).batchInternal.func1(#212523, *, 0, 0)
    stop           stopper.go:272             (*Stopper).RunTaskWithErr(#2322, #212523, *, *, 0, 0)
    server         node.go:854                (*Node).batchInternal(#1970, #212523, *, *, *, *, 0x550dc3)
    server         node.go:871                (*Node).Batch(#1970, #212523, *, *, #1970, *, *)
    roachpb        api.pb.go:1878             _Internal_Batch_Handler(#34, #1970, #212523, *, *, 0, 0, 0, *, *)
    grpc           server.go:738              (*Server).processUnaryRPC(#2341, #100, *, *, #2022, #68, 0, 0, 0)
    grpc           server.go:932              (*Server).handleStream(#2341, #100, *, *, 0)
    grpc           server.go:497              (*Server).serveStreams.func1.1(*, #2341, #100, *, *)
6379: select [21~22 minutes] [Created by grpc.(*Server).serveStreams.func1 @ server.go:498]
    storage        replica.go:1078            (*Replica).redirectOnOrAcquireLease.func2(*, *, #2543, #212523, *, *, 0)
    storage        replica.go:1128            (*Replica).redirectOnOrAcquireLease(#2543, #212523, *, 0, 0, 0, 0, 0, 0)
    storage        replica.go:2020            (*Replica).executeReadOnlyBatch(#2543, #212523, *, *, 0, #240, 0x3, 0x6, 0, *, ...)
    storage        replica.go:1434            (*Replica).Send(#2543, #212523, *, *, 0, #240, 0x3, 0x6, 0, *, ...)
    storage        store.go:2555              (*Store).Send(#2162, #212523, *, *, 0, #240, 0x3, 0x6, 0, *, ...)
    storage        stores.go:187              (*Stores).Send(#2216, #212523, *, 0, 0, #240, 0x3, 0x6, 0, *, ...)
    server         node.go:843                (*Node).batchInternal.func1(#212523, *, 0, 0)
    stop           stopper.go:272             (*Stopper).RunTaskWithErr(#2322, #212523, *, *, 0, 0)
    server         node.go:854                (*Node).batchInternal(#1970, #212523, *, *, *, *, 0x550dc3)
    server         node.go:871                (*Node).Batch(#1970, #212523, *, *, #1970, *, *)
    roachpb        api.pb.go:1878             _Internal_Batch_Handler(#34, #1970, #212523, *, *, 0, 0, 0, *, *)
    grpc           server.go:738              (*Server).processUnaryRPC(#2341, #100, *, *, #2022, #68, 0, 0, 0)
    grpc           server.go:932              (*Server).handleStream(#2341, #100, *, *, 0)
    grpc           server.go:497              (*Server).serveStreams.func1.1(*, #2341, #100, *, *)
2677: select [0~10 minutes] [Created by grpc.(*Server).serveStreams.func1 @ server.go:498]
    storage        replica.go:2259            (*Replica).tryExecuteWriteBatch(*, #212523, *, *, 0, #240, *, *, 0, *, ...)
    storage        replica.go:2097            (*Replica).executeWriteBatch(*, #212523, *, *, 0, #240, *, *, 0, *, ...)
    storage        replica.go:1431            (*Replica).Send(*, #212523, *, *, 0, #240, *, *, 0, *, ...)
    storage        store.go:2555              (*Store).Send(#2162, #212523, *, *, 0, #240, *, *, 0, *, ...)
    storage        stores.go:187              (*Stores).Send(#2216, #212523, *, 0, 0, #240, *, *, 0, *, ...)
    server         node.go:843                (*Node).batchInternal.func1(#212523, *, 0, 0)
    stop           stopper.go:272             (*Stopper).RunTaskWithErr(#2322, #212523, *, *, 0, 0)
    server         node.go:854                (*Node).batchInternal(#1970, #212523, *, *, *, *, 0x550dc3)
    server         node.go:871                (*Node).Batch(#1970, #212523, *, *, #1970, *, *)
    roachpb        api.pb.go:1878             _Internal_Batch_Handler(#34, #1970, #212523, *, *, 0, 0, 0, *, *)
    grpc           server.go:738              (*Server).processUnaryRPC(#2341, #100, *, *, #2022, #68, 0, 0, 0)
    grpc           server.go:932              (*Server).handleStream(#2341, #100, *, *, 0)
    grpc           server.go:497              (*Server).serveStreams.func1.1(*, #2341, #100, *, *)

@petermattis
Copy link
Collaborator

The mass of goroutines with stacks indicating they are trying to acquire or release table leases points to the lease table having a problem. The lease table is a single range (r6). A bit of grep'ing in the logs reveals this range is having lots of problems. There a ton of messages about slow/stuck operations.

@petermattis
Copy link
Collaborator

Grep'ing for r6 and snapshot reveals:

I171130 10:19:34.852743 123 storage/replica_raftstorage.go:415  [raftsnapshot,n4,s4,r6/3:/Table/1{1-2}] generated Raft snapshot 007a053d at index 1427305
I171130 10:19:35.647544 123 storage/store.go:3364  [raftsnapshot,n4,s4,r6/3:/Table/1{1-2}] streamed snapshot to (n1,s1):6: kv pairs: 159394, log entries: 4173, rate-limit: 8.0 MiB/sec, 795ms
I171130 10:19:35.985607 136 storage/replica_raftstorage.go:415  [raftsnapshot,n3,s3,r6/7:/Table/1{1-2}] generated Raft snapshot 595f122f at index 1427306
I171130 10:19:36.756418 136 storage/store.go:3364  [raftsnapshot,n3,s3,r6/7:/Table/1{1-2}] streamed snapshot to (n1,s1):6: kv pairs: 159392, log entries: 4174, rate-limit: 8.0 MiB/sec, 771ms
I171130 10:19:36.768862 156908425 storage/replica_raftstorage.go:595  [n1,s1,r6/6:/Table/1{1-2}] applying Raft snapshot at index 1427306 (id=595f122f, encoded size=22940379, 22 rocksdb batches, 4174 log entries)
I171130 10:19:38.144947 156908425 storage/replica_raftstorage.go:603  [n1,s1,r6/6:/Table/1{1-2}] applied Raft snapshot in 1376ms [clear=446ms batch=15ms entries=46ms commit=863ms]

Strange that both n3 and n4 generated a Raft snapshot for n1 at almost the same time. Perhaps the leadership changed hands then. And it is very interesting that r6 is the only range which generated a Raft snapshot (grep'ing for raftsnapshot reveals no other ranges besides r6).

The timing of the Raft snapshot is also right around when the cluster recovered.

@a-robinson
Copy link
Contributor

Almost exactly an hour and a half after n3 came back up, n1 got partially "unstuck", kicking off tons of activity related to r6:

2017-11-30 10:08:24: W171130 10:08:24.321301 135756169 kv/txn_coord_sender.go:817  [n1] heartbeat to "unnamed" id=240ad921 key=/Table/11/1/53/1/2017-11-17T09:23:08.281535Z/1 rw=true pri=0.05354313 iso=SERIALIZABLE stat=PENDING epo=0 ts=1512059910.429426592,0 orig=1512059910.429426592,0 max=1512059910.929426592,0 wto=false rop=true seq=13 failed: heartbeat for transaction "unnamed" id=240ad921 key=/Table/11/1/53/1/2017-11-17T09:23:08.281535Z/1 rw=true pri=0.05354313 iso=SERIALIZABLE stat=PENDING epo=0 ts=1512059910.429426592,0 orig=1512059910.429426592,0 max=1512059910.929426592,0 wto=false rop=true seq=14 failed; record not present
2017-11-30 10:08:24: W171130 10:08:24.486360 135756141 kv/txn_coord_sender.go:817  [n1] heartbeat to "unnamed" id=8389e0ca key=/Table/11/1/53/1/2017-11-18T10:20:05.310992Z/1 rw=true pri=0.05132444 iso=SERIALIZABLE stat=PENDING epo=0 ts=1512059910.439901342,0 orig=1512059910.439901342,0 max=1512059910.939901342,0 wto=false rop=true seq=13 failed: heartbeat for transaction "unnamed" id=8389e0ca key=/Table/11/1/53/1/2017-11-18T10:20:05.310992Z/1 rw=true pri=0.05132444 iso=SERIALIZABLE stat=PENDING epo=0 ts=1512059910.439901342,0 orig=1512059910.439901342,0 max=1512059910.939901342,0 wto=false rop=true seq=14 failed; record not present
2017-11-30 10:08:24: W171130 10:08:24.788741 135805595 kv/txn_coord_sender.go:817  [n1] heartbeat to "unnamed" id=00c0a833 key=/Table/11/1/52/1/2017-11-19T13:18:58.198972Z/1 rw=true pri=0.00633216 iso=SERIALIZABLE stat=PENDING epo=0 ts=1512059910.377148643,0 orig=1512059910.377148643,0 max=1512059910.877148643,0 wto=false rop=true seq=13 failed: failed to send RPC: sending to all 3 replicas failed; last error: rpc error: code = Canceled desc = context canceled
2017-11-30 10:08:24: W171130 10:08:24.789769 135864473 kv/txn_coord_sender.go:817  [n1] heartbeat to "unnamed" id=266e84dd key=/Table/11/1/52/1/2017-11-22T15:19:41.381745Z/1 rw=true pri=0.01687426 iso=SERIALIZABLE stat=PENDING epo=0 ts=1512059910.377827263,0 orig=1512059910.377827263,0 max=1512059910.877827263,0 wto=false rop=true seq=13 failed: failed to send RPC: sending to all 3 replicas failed; last error: rpc error: code = Canceled desc = context canceled
[...]

There's also some of these messages mixed in:

2017-11-30 10:08:33: W171130 10:08:33.821759 135569015 sql/lease.go:262  error releasing lease "53(\"users\") ver=1:1511033711612463000": result is ambiguous (sending to all 3 replicas failed; last error: rpc error: code = Internal desc = transport is closing, but RPC failure may have masked txn commit)

@a-robinson
Copy link
Contributor

Is it possible there was some sort of schema change ongoing around the time of the incident?

@petermattis
Copy link
Collaborator

Is it possible there was some sort of schema change ongoing around the time of the incident?

That would explain the table lease activity. In steady state we don't do any KV operations for table leases.

@a-robinson
Copy link
Contributor

Also, a lot of the keys being touched are quite old. Why would we have transactions anchored on keys like /Table/11/1/53/1/2017-11-17T09:23:08.281535Z/1 and /Table/11/1/53/1/2017-11-18T10:20:05.310992Z/1?

@petermattis
Copy link
Collaborator

That's the lease table primary key. descID=53, version=1, expiration=2017-11-17T09:23:08.281535Z, nodeID=1. No idea why the expiration is so old.

@a-robinson
Copy link
Contributor

Node 3 was the leaseholder for r6 when it went down -- it had been the leaseholder since Nov 14. Node 4 took over the lease at 8:33:30.465. The lease history is a little odd, though.

It has 3 consecutive entries in the range lease history, all with the same epoch and start time, with the last 2 even having the same proposed time. Maybe it's just a bug in the lease history tracking code, but that seems odd:

    {
      "start": {
        "wall_time": 1510698653265036216,
        "logical": 0
      },
      "expiration": {
        "wall_time": 0,
        "logical": 0
      },
      "replica": {
        "node_id": 3,
        "store_id": 3,
        "replica_id": 7
      },
      "deprecated_start_stasis": {
        "wall_time": 0,
        "logical": 0
      },
      "proposed_ts": {
        "wall_time": 1510698653265038441,
        "logical": 0
      },
      "epoch": 7
    },
    {
      "start": {
        "wall_time": 1512059610465161277,
        "logical": 0
      },
      "expiration": {
        "wall_time": 0,
        "logical": 0
      },
      "replica": {
        "node_id": 4,
        "store_id": 4,
        "replica_id": 3
      },
      "deprecated_start_stasis": {
        "wall_time": 0,
        "logical": 0
      },
      "proposed_ts": {
        "wall_time": 1512059610465165052,
        "logical": 0
      },
      "epoch": 11
    },
    {
      "start": {
        "wall_time": 1512059610465161277,
        "logical": 0
      },
      "expiration": {
        "wall_time": 0,
        "logical": 0
      },
      "replica": {
        "node_id": 4,
        "store_id": 4,
        "replica_id": 3
      },
      "deprecated_start_stasis": {
        "wall_time": 0,
        "logical": 0
      },
      "proposed_ts": {
        "wall_time": 1512059925364703223,
        "logical": 0
      },
      "epoch": 11
    },
    {
      "start": {
        "wall_time": 1512059610465161277,
        "logical": 0
      },
      "expiration": {
        "wall_time": 0,
        "logical": 0
      },
      "replica": {
        "node_id": 4,
        "store_id": 4,
        "replica_id": 3
      },
      "deprecated_start_stasis": {
        "wall_time": 0,
        "logical": 0
      },
      "proposed_ts": {
        "wall_time": 1512059925364703223,
        "logical": 0
      },
      "epoch": 11
    }
  ],
  "problems": {}
}

@a-robinson
Copy link
Contributor

Ooh, the raft state is interesting. Not only did we have a leader-not-leaseholder situation, with node 3 being the raft leader while node 4 was the leaseholder, node 4's replica (replica 3) was in state "paused": true:

  "raft_state": {
    "replica_id": 7,
    "hard_state": {
      "term": 385,
      "vote": 7,
      "commit": 612152
    },
    "lead": 7,
    "state": "StateLeader",
    "applied": 612152,
    "progress": {
      "3": {
        "next": 612020,
        "state": "ProgressStateProbe",
        "paused": true
      },
      "6": {
        "match": 612152,
        "next": 612153,
        "state": "ProgressStateReplicate"
      },
      "7": {
        "match": 612152,
        "next": 612153,
        "state": "ProgressStateProbe"
      }
    }
  },

That could certainly prevent progress. Time to figure out what "paused" means...

@petermattis
Copy link
Collaborator

I believe paused occurs if the leader doesn't hear from a follower.

@a-robinson
Copy link
Contributor

n4 was stuck in an older term, in which it thought that n1 was the leader:

"raft_state": {
    "replica_id": 3,
    "hard_state": {
      "term": 382,
      "vote": 6,
      "commit": 611152
    },
    "lead": 6,
    "state": "StateFollower",
    "applied": 611131,
    "progress": null
  },

It has the proper hardstate, though, it just hasn't caught up applying commands. Did it perhaps get stuck? A deadlock, maybe? Checking now.

@petermattis
Copy link
Collaborator

A follower is paused if it is in ProgressStateProbe and we append another Raft entry. This is done to avoid sending appends to a down follower. A follower is also paused after we send it a Raft snapshot and are waiting for it to respond with a MsgAppResp.

@a-robinson
Copy link
Contributor

Could that have triggered it?

Yes! It's good to have confirmation of something that could have kicked off all the lease-related requests.

I'm still trying to reproduce hour long badness. So far, the worst I've experienced is 2min of zero throughput.

It still may not be enough, but you should probably switch from local-SSD to HDD to increase the odds.

@petermattis
Copy link
Collaborator

It still may not be enough, but you should probably switch from local-SSD to HDD to increase the odds.

That doesn't appear to have been necessary. I let 92k table leases accumulate and then caused them to be dropped by setting a same cluster setting. Throughput almost immediately dropped to 0 and the number of goroutines in the system jumped to 240k (from a baseline of 1.2k). Throughput has remained at 0 for 15min and the cluster shows no signs of recovering. One node was killed due to memory growth. I'm not noticing any slow heartbeat stuff in the logs, but they are filled with have been waiting 1m0s messages.

a-robinson added a commit to a-robinson/cockroach that referenced this issue Dec 4, 2017
@gpaul
Copy link
Author

gpaul commented Dec 4, 2017

If a cluster is running v1.0.6 and has built up a large amount of stale leases is it safe to perform a rolling upgrade to v1.1.3+ without triggering an outage? If so, what will happen to the ~100k stale leases? If I understand correctly those correspond to physical records in a system database?

@petermattis
Copy link
Collaborator

Yes, the table leases correspond go physical records in the system.lease table. We'll have to do some experimentation to determine if performing a rolling upgrade is safe. My two concerns are that the rolling upgrade itself might trigger purging of the old leases and that v1.1.3+ might not ever clear out the old leases.

@gpaul
Copy link
Author

gpaul commented Dec 5, 2017

Would it make sense to just delete old leases? The expiration column suggests that might make sense:

[email protected]:26257/system> select * from system.lease limit 100;
+--------+---------+--------+----------------------------------+
| descID | version | nodeID |            expiration            |
+--------+---------+--------+----------------------------------+
|     51 |       1 |      1 | 2017-12-02 07:27:15.387225+00:00 |
|     51 |       1 |      2 | 2017-12-02 07:27:58.128568+00:00 |
|     51 |       1 |      5 | 2017-12-02 07:28:03.962719+00:00 |
|     51 |       1 |      3 | 2017-12-02 07:29:07.289794+00:00 |
|     51 |       1 |      4 | 2017-12-02 07:29:32.342058+00:00 |
|     51 |       1 |      2 | 2017-12-02 07:34:02.36372+00:00  |
|     51 |       1 |      5 | 2017-12-02 07:34:09.735477+00:00 |
|     51 |       1 |      1 | 2017-12-02 07:34:34.129586+00:00 |
|     51 |       1 |      4 | 2017-12-02 07:35:00.852637+00:00 |
|     51 |       1 |      3 | 2017-12-02 07:35:44.470331+00:00 |
|     51 |       1 |      2 | 2017-12-02 07:39:27.277347+00:00 |
|     51 |       1 |      1 | 2017-12-02 07:39:56.759884+00:00 |
|     51 |       1 |      4 | 2017-12-02 07:40:08.561545+00:00 |
|     51 |       1 |      3 | 2017-12-02 07:40:52.459743+00:00 |
|     51 |       1 |      5 | 2017-12-02 07:41:12.374174+00:00 |
...

@petermattis
Copy link
Collaborator

Would it make sense to just delete old leases?

Is your suggestion to manually delete the old leases? That will likely run afoul of internal invariants that only the node creating a lease should delete it. If your suggestion is that Cockroach should delete the old leases, well, that's exactly the bug. It should be deleting these old leases except for a bug in 1.0.x prevented the old leases from being deleted in a fairly common scenario. Setting a cluster setting triggers an additional code path which does cause the old leases to be deleted, but there is no throttling on that deletion and Cockroach spawns a goroutine per deletion. This is a massive spike in traffic and overwhelms parts of the system leading to the freeze up.

@gpaul
Copy link
Author

gpaul commented Dec 5, 2017

Is your suggestion to manually delete the old leases?

That was what I had in mind, yes. It makes sense that this could lead to issues, but I was hoping it wouldn't.

I have several clusters running for a long time that will likely run into this issue, too. I'd really like to prevent them from experiencing a major outage.

@petermattis
Copy link
Collaborator

I have several clusters running for a long time that will likely run into this issue, too. I'd really like to prevent them from experiencing a major outage.

I think you should be able to perform a rolling restart of the Cockroach nodes, but let me confer with others before endorsing that action.

@gpaul
Copy link
Author

gpaul commented Dec 5, 2017

OK - are you referring to a rolling upgrade or a rolling restart of the affected instances? Would you mind elaborating on why that might solve the issue when you determine that it will?

To summarize, I have several clusters that have no choice but to perform triggering actions (these are baked into releases that cannot be skipped) and if there is any way to clean out the stale leases as a pre-upgrade step and thereby avoid triggering the outage then that would be really really good.

@gpaul
Copy link
Author

gpaul commented Dec 5, 2017

Considering that we're looking at multi-hour outages, even some set of steps that necessitates stopping the cockroachdb instances or causing a much shorter outage would help.

@petermattis
Copy link
Collaborator

I'm referring to a rolling restart of the affected instances. When a rolling restart is performed, the "leaked" leases on the restarted node will no longer be candidates for deletion (they will not be picked up by the restarted node). This effectively causes the leases to remain in the system.lease table permanently (or until manually deleted), but that is less of a problem and I'm about to open an issue to address that. The upshot of the rolling restart is that we won't hit the condition where the system tries to delete tens or hundreds of thousands of leases simultaneously.

You'll want to make sure that the rolling restart is just the cockroach nodes. You would definitely not want to accidentally trigger deletion of the leases (e.g. by setting a cluster setting).

@gpaul
Copy link
Author

gpaul commented Dec 5, 2017

This is great news, thank you. I'm happy to perform a rolling restart of the affected cockroachdb instances before upgrading.

@petermattis
Copy link
Collaborator

We can later investigate deleting the "leaked" leases from the system.lease table. Do you have a test cluster you can test this on? If not, I can see about setting up a test to verify this will work smoothly.

@gpaul
Copy link
Author

gpaul commented Dec 5, 2017

I have a test cluster I can test this on - thanks, doing so now.

@petermattis
Copy link
Collaborator

#20485

@gpaul
Copy link
Author

gpaul commented Dec 5, 2017

I can confirm that a rolling restart, followed by a trigger action, does not cause an outage.

@petermattis
Copy link
Collaborator

Fantastic. Can you also confirm that there are still a lot of leases in system.lease via select count(*) from system.lease?

@gpaul
Copy link
Author

gpaul commented Dec 5, 2017

select count(*) from system.lease;
+----------+
| count(*) |
+----------+
|     1152 |
+----------+
(1 row)

That's less than I would expect and all of them are set to expire in the last day or two. An exact duplicate of this cluster had 22k before I triggered an outage. Hmm - I wish I'd checked this number before performing the rolling restart and subsequent trigger action.

I'll have to wait a few days for these clusters to build up some stale leases again before I'm able to confirm beyond a doubt that it works.

@gpaul
Copy link
Author

gpaul commented Dec 5, 2017

If not, I can see about setting up a test to verify this will work smoothly.

I might have to take you up on that offer.

@petermattis
Copy link
Collaborator

Yeah, that number of leases isn't enough to cause a problem. I'll see about doing a test locally. It probably won't happen until tomorrow or Thursday.

@gpaul
Copy link
Author

gpaul commented Dec 5, 2017

I borrowed a long running cluster from a colleague and determined that before:

[email protected]:26257/system> select count(*) from system.lease;
+----------+
| count(*) |
+----------+
| 150043 |
+----------+
(1 row)

And after restarting and performing a trigger action:

[email protected]:26257/system> select count(*) from system.lease;
+----------+
| count(*) |
+----------+
| 143677 |
+----------+
(1 row)

It appears to work perfectly, thanks.

@gpaul
Copy link
Author

gpaul commented Feb 2, 2018

Closing, thanks for the wonderful support everyone!

@gpaul gpaul closed this as completed Feb 2, 2018
@jordanlewis jordanlewis added C-question A question rather than an issue. No code/spec/doc change needed. O-community Originated from the community and removed O-deprecated-community-questions labels Apr 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-question A question rather than an issue. No code/spec/doc change needed. O-community Originated from the community
Projects
None yet
Development

No branches or pull requests

5 participants