Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distribution: Decommission causing latency spike of 10s #35890

Closed
awoods187 opened this issue Mar 18, 2019 · 9 comments
Closed

distribution: Decommission causing latency spike of 10s #35890

awoods187 opened this issue Mar 18, 2019 · 9 comments
Labels
A-kv-decom-rolling-restart Decommission and Rolling Restarts A-kv-distribution Relating to rebalancing and leasing. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. no-issue-activity S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. T-kv KV Team X-stale

Comments

@awoods187
Copy link
Contributor

awoods187 commented Mar 18, 2019

Describe the problem
I set up a 6 node cluster running TPC-C with 5k warehouses (under 4k active load) and see large latency spikes. Latency was ~500ms before I began decommissioning.

image

Note that the gap in the middle was because i didn't run with --tolerate-errors (which I'm now running with).

Also, note that the rebalancing was more or less horizontal for 40 minutes.

To Reproduce

export CLUSTER=andy-decommission
roachprod create $CLUSTER -n 7 --clouds=aws --aws-machine-type-ssd=c5d.4xlarge
roachprod run $CLUSTER -- "DEV=$(mount | grep /mnt/data1 | awk '{print $1}'); sudo umount /mnt/data1; sudo mount -o discard,defaults,nobarrier ${DEV} /mnt/data1/; mount | grep /mnt/data1"
roachprod stage $CLUSTER:1-6 cockroach
roachprod stage $CLUSTER:7 workload
roachprod start $CLUSTER:1-6 -e COCKROACH_ENGINE_MAX_SYNC_DURATION=24h
roachprod adminurl --open $CLUSTER:1
roachprod run $CLUSTER:1 -- "./cockroach workload fixtures import tpcc --warehouses=5000 --db=tpcc"
roachprod run $CLUSTER:7 "./workload run tpcc --ramp=10m --warehouses=5000 --active-warehouses=4000 --duration=10h --split --scatter {pgurl:1-6}"
Post error, I'm now running:
roachprod run $CLUSTER:7 "./workload run tpcc --ramp=10m --warehouses=5000 --active-warehouses=4000 --duration=10h --tolerate-errors --scatter {pgurl:1-6}"

Expected behavior
Minimal impact on latency and constant slope of rebalancing nodes.

Environment:
v19.1.0-beta.20190304-507-gc2939ec

Jira issue: CRDB-4546

@awoods187 awoods187 added the A-kv-distribution Relating to rebalancing and leasing. label Mar 18, 2019
@awoods187 awoods187 added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. labels Mar 18, 2019
@tbg
Copy link
Member

tbg commented Mar 18, 2019

did the latencies return to baseline when decommissioning was done and the replicas per node graph had dropped to zero for the decommissioned node?

I assume the node was still running, or was it dead?

@awoods187
Copy link
Contributor Author

This is running live right now on andy-decommission

@awoods187
Copy link
Contributor Author

We tried:

ubuntu@ip-172-31-40-194:~$ ./cockroach quit --insecure
ok

And saw the latency spike drop completely back to pre-numbers
image

@tbg
Copy link
Member

tbg commented Mar 18, 2019

The theory behind trying this is that as replicas are moved off the decommissioning node, "orphaned" replicas waiting for replicaGC are left behind. Requests ending up at them will be left hanging until the replicaGC happens, which is typically quick but perhaps isn't actually as quick for some reason.

Seems to repro readily, so all we need to fix this is some elbow grease. As far as roachtesting this goes, we should have a tpcc run like the above and assert that p99 remains below some threshold.

@awoods187
Copy link
Contributor Author

This might be a separate problem but I tried to recommission the node after our experiment:

ubuntu@ip-172-31-40-194:~$ ./cockroach node recommission 2 --insecure
  id | is_live | replicas | is_decommissioning | is_draining
+----+---------+----------+--------------------+-------------+
   2 |  true   |    10110 |       false        |    false
(1 row)

It resulted in two suspect nodes and a number of under-replicated ranges:
image
This included a hit to QPS and a larger latency spike:
image
And eventually a dead node:
image

Caused by a panic in Marshal to. I think it's probably related to #35803

* ERROR: [n2,client=172.31.43.41:49488,user=root,txn=9f44c719] a panic has occurred!
*
runtime error: index out of range

goroutine 208498867 [running]:
runtime/debug.Stack(0x39e6dc0, 0xc03f4e5a10, 0xc000000003)
	/usr/local/go/src/runtime/debug/stack.go:24 +0xa7
github.com/cockroachdb/cockroach/pkg/util/log.ReportPanic(0x39e6dc0, 0xc03f4e5a10, 0xc000107300, 0x2e8cc80, 0x56ad210, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/crash_reporting.go:226 +0xa6
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Recover(0xc00050a090, 0x39e6dc0, 0xc03f4e5a10)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:183 +0xe4
panic(0x2e8cc80, 0x56ad210)
	/usr/local/go/src/runtime/panic.go:513 +0x1b9
github.com/cockroachdb/cockroach/pkg/roachpb.(*ScanRequest).MarshalTo(0xc096af9a00, 0xc02aef57e5, 0x13, 0x13, 0x15, 0x2, 0x11)
	/go/src/github.com/cockroachdb/cockroach/pkg/roachpb/api.pb.go:10185 +0x17d
github.com/cockroachdb/cockroach/pkg/roachpb.(*RequestUnion_Scan).MarshalTo(0xc0207c6f70, 0xc02aef57e3, 0x15, 0x15, 0xf52cee, 0xc0207c6f70, 0x17)
	/go/src/github.com/cockroachdb/cockroach/pkg/roachpb/api.pb.go:13287 +0xdf
github.com/cockroachdb/cockroach/pkg/roachpb.(*RequestUnion).MarshalTo(0xc0784b8838, 0xc02aef57e3, 0x15, 0x15, 0x17, 0xa3, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/roachpb/api.pb.go:13188 +0x73
github.com/cockroachdb/cockroach/pkg/roachpb.(*BatchRequest).MarshalTo(0xc04a744400, 0xc02aef5740, 0xb8, 0xb8, 0xb8, 0xb8, 0x3244600)
	/go/src/github.com/cockroachdb/cockroach/pkg/roachpb/api.pb.go:14600 +0x246
github.com/cockroachdb/cockroach/pkg/roachpb.(*BatchRequest).Marshal(0xc04a744400, 0x3244600, 0xc04a744400, 0x7fccd7afadf0, 0xc04a744400, 0xc00e315c01)
	/go/src/github.com/cockroachdb/cockroach/pkg/roachpb/api.pb.go:14575 +0x7f
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/encoding/proto.codec.Marshal(0x3244600, 0xc04a744400, 0xc016468520, 0x3, 0xc000062f70, 0xc000062f00, 0x7fccd7b36440)
	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/encoding/proto/proto.go:70 +0x19c
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.encode(0x7fccd6a051a0, 0x5b41bb8, 0x3244600, 0xc04a744400, 0xc052e4e780, 0x3a12ba0, 0x39e7b00, 0x6, 0x0)
	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/rpc_util.go:487 +0x5e
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*csAttempt).sendMsg(0xc06f6a3110, 0x3244600, 0xc04a744400, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/stream.go:482 +0xca
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*clientStream).SendMsg(0xc04a744480, 0x3244600, 0xc04a744400, 0xc0010be000, 0x32bd2e8)
	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/stream.go:403 +0x43
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.invoke(0x39e6dc0, 0xc03f4e5a10, 0x32bd2e8, 0x21, 0x3244600, 0xc04a744400, 0x3196040, 0xc01b1c1c70, 0xc0010be000, 0xc0010bc840, ...)
	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/call.go:75 +0xfe
github.com/cockroachdb/cockroach/vendor/github.com/grpc-ecosystem/grpc-opentracing/go/otgrpc.OpenTracingClientInterceptor.func1(0x39e6dc0, 0xc03f4e5a10, 0x32bd2e8, 0x21, 0x3244600, 0xc04a744400, 0x3196040, 0xc01b1c1c70, 0xc0010be000, 0x33e89c8, ...)
	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/grpc-ecosystem/grpc-opentracing/go/otgrpc/client.go:47 +0xb49
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*ClientConn).Invoke(0xc0010be000, 0x39e6dc0, 0xc03f4e5a10, 0x32bd2e8, 0x21, 0x3244600, 0xc04a744400, 0x3196040, 0xc01b1c1c70, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/call.go:35 +0x109
github.com/cockroachdb/cockroach/pkg/roachpb.(*internalClient).Batch(0xc01aeb9cc0, 0x39e6dc0, 0xc03f4e5a10, 0xc04a744400, 0x0, 0x0, 0x0, 0xc03f4e5a10, 0x1, 0x39e6dc0)
	/go/src/github.com/cockroachdb/cockroach/pkg/roachpb/api.pb.go:9324 +0xd2
github.com/cockroachdb/cockroach/pkg/kv.(*grpcTransport).sendBatch(0xc052e4e720, 0x39e6dc0, 0xc03f4e5a10, 0x1, 0x39ba440, 0xc01aeb9cc0, 0x0, 0x0, 0x100000001, 0x2, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/transport.go:199 +0x126
github.com/cockroachdb/cockroach/pkg/kv.(*grpcTransport).SendNext(0xc052e4e720, 0x39e6dc0, 0xc03f4e5a10, 0x0, 0x0, 0x100000001, 0x2, 0x4164, 0x0, 0xc0a37e7900, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/transport.go:168 +0x130
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendToReplicas(0xc0005fb8c0, 0x39e6dc0, 0xc03f4e5a10, 0xc0005fb910, 0x4164, 0xc0a6148d70, 0x3, 0x3, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1365 +0x2d3
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendRPC(0xc0005fb8c0, 0x39e6dc0, 0xc03f4e5a10, 0x4164, 0xc0a6148d70, 0x3, 0x3, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:416 +0x244
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendSingleRange(0xc0005fb8c0, 0x39e6dc0, 0xc03f4e5a10, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc0a37e7900, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:496 +0x221
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatch(0xc0005fb8c0, 0x39e6dc0, 0xc03f4e5a10, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc0a37e7900, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1141 +0x322
github.com/cockroachdb/cockroach/pkg/kv.(*DistSender).sendPartialBatchAsync.func1(0x39e6dc0, 0xc03f4e5a10)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/dist_sender.go:1063 +0x175
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunLimitedAsyncTask.func1(0xc00050a090, 0x39e6dc0, 0xc03f4e5a10, 0x32c9013, 0x24, 0xc000bef020, 0x3a1c820, 0xc000333540, 0xc096c9dc00)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:385 +0x110
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunLimitedAsyncTask
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:379 +0x23a

cockroach.log

@awoods187
Copy link
Contributor Author

@tbg I assume this won't be fixed for 19.1 since no one has looked at it yet

@tbg
Copy link
Member

tbg commented Mar 26, 2019

With timeframes being what they are, it seems reasonable to assume that. I'd like to have someone look at this soon, though. Perhaps @andreimatei once the consistency stuff is mitigated.

@awoods187
Copy link
Contributor Author

Ran this again just to see if it had changed:
image

Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 4, 2023
@github-project-automation github-project-automation bot moved this to Closed in KV Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-decom-rolling-restart Decommission and Rolling Restarts A-kv-distribution Relating to rebalancing and leasing. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. no-issue-activity S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. T-kv KV Team X-stale
Projects
No open projects
Archived in project
Development

No branches or pull requests

4 participants