-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
distribution: Decommission causing latency spike of 10s #35890
Comments
did the latencies return to baseline when decommissioning was done and the replicas per node graph had dropped to zero for the decommissioned node? I assume the node was still running, or was it dead? |
This is running live right now on |
The theory behind trying this is that as replicas are moved off the decommissioning node, "orphaned" replicas waiting for replicaGC are left behind. Requests ending up at them will be left hanging until the replicaGC happens, which is typically quick but perhaps isn't actually as quick for some reason. Seems to repro readily, so all we need to fix this is some elbow grease. As far as roachtesting this goes, we should have a tpcc run like the above and assert that p99 remains below some threshold. |
This might be a separate problem but I tried to recommission the node after our experiment:
It resulted in two suspect nodes and a number of under-replicated ranges: Caused by a panic in Marshal to. I think it's probably related to #35803
|
@tbg I assume this won't be fixed for 19.1 since no one has looked at it yet |
With timeframes being what they are, it seems reasonable to assume that. I'd like to have someone look at this soon, though. Perhaps @andreimatei once the consistency stuff is mitigated. |
We have marked this issue as stale because it has been inactive for |
Describe the problem
I set up a 6 node cluster running TPC-C with 5k warehouses (under 4k active load) and see large latency spikes. Latency was ~500ms before I began decommissioning.
Note that the gap in the middle was because i didn't run with
--tolerate-errors
(which I'm now running with).Also, note that the rebalancing was more or less horizontal for 40 minutes.
To Reproduce
export CLUSTER=andy-decommission$CLUSTER -- "DEV=$ (mount | grep /mnt/data1 | awk '{print $1}'); sudo umount /mnt/data1; sudo mount -o discard,defaults,nobarrier $ {DEV} /mnt/data1/; mount | grep /mnt/data1"
roachprod create $CLUSTER -n 7 --clouds=aws --aws-machine-type-ssd=c5d.4xlarge
roachprod run
roachprod stage $CLUSTER:1-6 cockroach
roachprod stage $CLUSTER:7 workload
roachprod start $CLUSTER:1-6 -e COCKROACH_ENGINE_MAX_SYNC_DURATION=24h
roachprod adminurl --open $CLUSTER:1
roachprod run $CLUSTER:1 -- "./cockroach workload fixtures import tpcc --warehouses=5000 --db=tpcc"
roachprod run $CLUSTER:7 "./workload run tpcc --ramp=10m --warehouses=5000 --active-warehouses=4000 --duration=10h --split --scatter {pgurl:1-6}"
Post error, I'm now running:
roachprod run $CLUSTER:7 "./workload run tpcc --ramp=10m --warehouses=5000 --active-warehouses=4000 --duration=10h --tolerate-errors --scatter {pgurl:1-6}"
Expected behavior
Minimal impact on latency and constant slope of rebalancing nodes.
Environment:
v19.1.0-beta.20190304-507-gc2939ec
Jira issue: CRDB-4546
The text was updated successfully, but these errors were encountered: