distribution: Decommission causing latency spike of 10s #35890

awoods187 · 2019-03-18T18:53:08Z

Describe the problem
I set up a 6 node cluster running TPC-C with 5k warehouses (under 4k active load) and see large latency spikes. Latency was ~500ms before I began decommissioning.

Note that the gap in the middle was because i didn't run with --tolerate-errors (which I'm now running with).

Also, note that the rebalancing was more or less horizontal for 40 minutes.

To Reproduce

export CLUSTER=andy-decommission
roachprod create $CLUSTER -n 7 --clouds=aws --aws-machine-type-ssd=c5d.4xlarge
roachprod run $CLUSTER -- "DEV=$(mount | grep /mnt/data1 | awk '{print $1}'); sudo umount /mnt/data1; sudo mount -o discard,defaults,nobarrier ${DEV} /mnt/data1/; mount | grep /mnt/data1"
roachprod stage $CLUSTER:1-6 cockroach
roachprod stage $CLUSTER:7 workload
roachprod start $CLUSTER:1-6 -e COCKROACH_ENGINE_MAX_SYNC_DURATION=24h
roachprod adminurl --open $CLUSTER:1
roachprod run $CLUSTER:1 -- "./cockroach workload fixtures import tpcc --warehouses=5000 --db=tpcc"
roachprod run $CLUSTER:7 "./workload run tpcc --ramp=10m --warehouses=5000 --active-warehouses=4000 --duration=10h --split --scatter {pgurl:1-6}"
Post error, I'm now running:
roachprod run $CLUSTER:7 "./workload run tpcc --ramp=10m --warehouses=5000 --active-warehouses=4000 --duration=10h --tolerate-errors --scatter {pgurl:1-6}"

Expected behavior
Minimal impact on latency and constant slope of rebalancing nodes.

Environment:
v19.1.0-beta.20190304-507-gc2939ec

Jira issue: CRDB-4546

The text was updated successfully, but these errors were encountered:

tbg · 2019-03-18T19:22:12Z

did the latencies return to baseline when decommissioning was done and the replicas per node graph had dropped to zero for the decommissioned node?

I assume the node was still running, or was it dead?

awoods187 · 2019-03-18T19:24:25Z

This is running live right now on andy-decommission

awoods187 · 2019-03-18T19:40:43Z

We tried:

ubuntu@ip-172-31-40-194:~$ ./cockroach quit --insecure
ok

And saw the latency spike drop completely back to pre-numbers

tbg · 2019-03-18T19:43:56Z

The theory behind trying this is that as replicas are moved off the decommissioning node, "orphaned" replicas waiting for replicaGC are left behind. Requests ending up at them will be left hanging until the replicaGC happens, which is typically quick but perhaps isn't actually as quick for some reason.

Seems to repro readily, so all we need to fix this is some elbow grease. As far as roachtesting this goes, we should have a tpcc run like the above and assert that p99 remains below some threshold.

awoods187 · 2019-03-18T20:16:54Z

This might be a separate problem but I tried to recommission the node after our experiment:

ubuntu@ip-172-31-40-194:~$ ./cockroach node recommission 2 --insecure
  id | is_live | replicas | is_decommissioning | is_draining
+----+---------+----------+--------------------+-------------+
   2 |  true   |    10110 |       false        |    false
(1 row)

It resulted in two suspect nodes and a number of under-replicated ranges:

This included a hit to QPS and a larger latency spike:

And eventually a dead node:

Caused by a panic in Marshal to. I think it's probably related to #35803

* ERROR: [n2,client=172.31.43.41:49488,user=root,txn=9f44c719] a panic has occurred!
*
runtime error: index out of range

goroutine 208498867 [running]:
runtime/debug.Stack(0x39e6dc0, 0xc03f4e5a10, 0xc000000003)
	/usr/local/go/src/runtime/debug/stack.go:24 +0xa7
github.com/cockroachdb/cockroach/pkg/util/log.ReportPanic(0x39e6dc0, 0xc03f4e5a10, 0xc000107300, 0x2e8cc80, 0x56ad210, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/crash_reporting.go:226 +0xa6
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Recover(0xc00050a090, 0x39e6dc0, 0xc03f4e5a10)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:183 +0xe4
panic(0x2e8cc80, 0x56ad210)
	/usr/local/go/src/runtime/panic.go:513 +0x1b9
github.com/cockroachdb/cockroach/pkg/roachpb.(*ScanRequest).MarshalTo(0xc096af9a00, 0xc02aef57e5, 0x13, 0x13, 0x15, 0x2, 0x11)
	/go/src/github.com/cockroachdb/cockroach/pkg/roachpb/api.pb.go:10185 +0x17d
github.com/cockroachdb/cockroach/pkg/roachpb.(*RequestUnion_Scan).MarshalTo(0xc0207c6f70, 0xc02aef57e3, 0x15, 0x15, 0xf52cee, 0xc0207c6f70, 0x17)
	/go/src/github.com/cockroachdb/cockroach/pkg/roachpb/api.pb.go:13287 +0xdf
github.com/cockroachdb/cockroach/pkg/roachpb.(*RequestUnion).MarshalTo(0xc0784b8838, 0xc02aef57e3, 0x15, 0x15, 0x17, 0xa3, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/roachpb/api.pb.go:13188 +0x73
github.com/cockroachdb/cockroach/pkg/roachpb.(*BatchRequest).MarshalTo(0xc04a744400, 0xc02aef5740, 0xb8, 0xb8, 0xb8, 0xb8, 0x3244600)
	/go/src/github.com/cockroachdb/cockroach/pkg/roachpb/api.pb.go:14600 +0x246
github.com/cockroachdb/cockroach/pkg/roachpb.(*BatchRequest).Marshal(0xc04a744400, 0x3244600, 0xc04a744400, 0x7fccd7afadf0, 0xc04a744400, 0xc00e315c01)
	/go/src/github.com/cockroachdb/cockroach/pkg/roachpb/api.pb.go:14575 +0x7f
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/encoding/proto.codec.Marshal(0x3244600, 0xc04a744400, 0xc016468520, 0x3, 0xc000062f70, 0xc000062f00, 0x7fccd7b36440)
	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/encoding/proto/proto.go:70 +0x19c
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.encode(0x7fccd6a051a0, 0x5b41bb8, 0x3244600, 0xc04a744400, 0xc052e4e780, 0x3a12ba0, 0x39e7b00, 0x6, 0x0)
	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/rpc_util.go:487 +0x5e
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*csAttempt).sendMsg(0xc06f6a3110, 0x3244600, 0xc04a744400, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/stream.go:482 +0xca
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*clientStream).SendMsg(0xc04a744480, 0x3244600, 0xc04a744400, 0xc0010be000, 0x32bd2e8)
	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/stream.go:403 +0x43
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.invoke(0x39e6dc0, 0xc03f4e5a10, 0x32bd2e8, 0x21, 0x3244600, 0xc04a744400, 0x3196040, 0xc01b1c1c70, 0xc0010be000, 0xc0010bc840, ...)
	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/call.go:75 +0xfe
github.com/cockroachdb/cockroach/vendor/github.com/grpc-ecosystem/grpc-opentracing/go/otgrpc.OpenTracingClientInterceptor.func1(0x39e6dc0, 0xc03f4e5a10, 0x32bd2e8, 0x21, 0x3244600, 0xc04a744400, 0x3196040, 0xc01b1c1c70, 0xc0010be000, 0x33e89c8, ...)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/grpc-ecosystem/grpc-opentracing/go/otgrpc/client.go:47 +0xb49
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*ClientConn).Invoke(0xc0010be000, 0x39e6dc0, 0xc03f4e5a10, 0x32bd2e8, 0x21, 0x3244600, 0xc04a744400, 0x3196040, 0xc01b1c1c70, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/call.go:35 +0x109
github.com/cockroachdb/cockroach/pkg/roachpb.(*internalClient).Batch(0xc01aeb9cc0, 0x39e6dc0, 0xc03f4e5a10, 0xc04a744400, 0x0, 0x0, 0x0, 0xc03f4e5a10, 0x1, 0x39e6dc0)
	/go/src/github.com/cockroachdb/cockroach/pkg/roachpb/api.pb.go:9324 +0xd2
github.com/cockroachdb/cockroach/pkg/kv.(*grpcTransport).sendBatch(0xc052e4e720, 0x39e6dc0, 0xc03f4e5a10, 0x1, 0x39ba440, 0xc01aeb9cc0, 0x0, 0x0, 0x100000001, 0x2, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/transport.go:199 +0x126
github.com/cockroachdb/cockroach/pkg/kv.(*grpcTransport).SendNext(0xc052e4e720, 0x39e6dc0, 0xc03f4e5a10, 0x0, 0x0, 0x100000001, 0x2, 0x4164, 0x0, 0xc0a37e7900, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/transport.go:168 +0x130
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendToReplicas(0xc0005fb8c0, 0x39e6dc0, 0xc03f4e5a10, 0xc0005fb910, 0x4164, 0xc0a6148d70, 0x3, 0x3, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1365 +0x2d3
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendRPC(0xc0005fb8c0, 0x39e6dc0, 0xc03f4e5a10, 0x4164, 0xc0a6148d70, 0x3, 0x3, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:416 +0x244
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendSingleRange(0xc0005fb8c0, 0x39e6dc0, 0xc03f4e5a10, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc0a37e7900, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:496 +0x221
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc0005fb8c0, 0x39e6dc0, 0xc03f4e5a10, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc0a37e7900, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1141 +0x322
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatchAsync.func1(0x39e6dc0, 0xc03f4e5a10)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1063 +0x175
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunLimitedAsyncTask.func1(0xc00050a090, 0x39e6dc0, 0xc03f4e5a10, 0x32c9013, 0x24, 0xc000bef020, 0x3a1c820, 0xc000333540, 0xc096c9dc00)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:385 +0x110
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunLimitedAsyncTask
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:379 +0x23a

cockroach.log

awoods187 · 2019-03-25T21:49:09Z

@tbg I assume this won't be fixed for 19.1 since no one has looked at it yet

tbg · 2019-03-26T13:30:08Z

With timeframes being what they are, it seems reasonable to assume that. I'd like to have someone look at this soon, though. Perhaps @andreimatei once the consistency stuff is mitigated.

awoods187 · 2019-03-28T20:05:15Z

Ran this again just to see if it had changed:

github-actions · 2023-11-23T11:05:51Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

awoods187 added the A-kv-distribution Relating to rebalancing and leasing. label Mar 18, 2019

awoods187 added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. labels Mar 18, 2019

awoods187 assigned tbg Mar 18, 2019

nvanbenschoten mentioned this issue Mar 18, 2019

Panic in MarshalTo(0xc13312b400, 0xc1541c14ec, 0x13, 0x13, 0x15, 0x2, 0x11) #35803

Closed

rmloveland mentioned this issue Sep 4, 2019

Improve decommissioning, recommissioning, and recovery of killed nodes cockroachdb/docs#4725

Closed

lunevalex added the A-kv-decom-rolling-restart Decommission and Rolling Restarts label Aug 6, 2020

jlinder added the T-kv KV Team label Jun 16, 2021

tbg removed their assignment May 31, 2022

github-actions bot added the no-issue-activity label Nov 23, 2023

github-actions bot added the X-stale label Dec 4, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 4, 2023

exalate-issue-sync bot closed this as completed Dec 4, 2023

github-project-automation bot added this to KV Aug 28, 2024

github-project-automation bot moved this to Closed in KV Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distribution: Decommission causing latency spike of 10s #35890

distribution: Decommission causing latency spike of 10s #35890

awoods187 commented Mar 18, 2019 •

edited by cockroach-jira-scripts

Loading

tbg commented Mar 18, 2019

awoods187 commented Mar 18, 2019

awoods187 commented Mar 18, 2019

tbg commented Mar 18, 2019

awoods187 commented Mar 18, 2019

awoods187 commented Mar 25, 2019

tbg commented Mar 26, 2019

awoods187 commented Mar 28, 2019

github-actions bot commented Nov 23, 2023

distribution: Decommission causing latency spike of 10s #35890

distribution: Decommission causing latency spike of 10s #35890

Comments

awoods187 commented Mar 18, 2019 • edited by cockroach-jira-scripts Loading

tbg commented Mar 18, 2019

awoods187 commented Mar 18, 2019

awoods187 commented Mar 18, 2019

tbg commented Mar 18, 2019

awoods187 commented Mar 18, 2019

awoods187 commented Mar 25, 2019

tbg commented Mar 26, 2019

awoods187 commented Mar 28, 2019

github-actions bot commented Nov 23, 2023

awoods187 commented Mar 18, 2019 •

edited by cockroach-jira-scripts

Loading