kv: large-disk cluster has poor performance post-restart #56876

jbowens · 2020-11-18T23:52:42Z

Reproduction steps

Create a cluster with 10 TB disks, eg:

roachprod create jackson-bank-200b --username jackson --clouds aws -n 8 \
  --local-ssd=false --aws-machine-type c5.4xlarge --aws-ebs-volume-size=10000
roachprod put    jackson-bank-200b ./cockroach-linux-2.6.32-gnu-amd64 ./cockroach
roachprod start  jackson-bank-200b
roachprod sql    jackson-bank-200b:1 -- -e \
   "set cluster setting kv.range_merge.queue_enabled = false"
roachprod sql    jackson-bank-200b:1 -- -e \
   "set cluster setting kv.bulk_io_write.concurrent_addsstable_requests = 8"

Import a large bank dataset. (This takes ~28 hours.)

roachprod run    jackson-bank-200b:1 -- \
   "./cockroach workload fixtures import bank --rows 200000000000 {pgurl:1} >/dev/null 2>&1 &"

Run the bank workload:

roachprod stop jackson-bank-200b
roachprod start jackson-bank-200b
roachprod run jackson-bank-200b:1 -- \
    ./cockroach workload run bank --ramp=5m --rows=200000000000 --duration=15m

Throughput is around ~40 ops/sec. See https://docs.google.com/document/d/1rfWNGFZ6gulKqb6BMXeMzq9GkMgdVfXnC4SJ3FM3ilE/edit?usp=sharing

Jira issue: CRDB-2897

The text was updated successfully, but these errors were encountered:

blathers-crl · 2020-11-18T23:52:44Z

Hi @jbowens, please add a C-ategory label to your issue. Check out the label system docs.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

jbowens · 2020-11-19T01:25:45Z

I left this cluster for a bit and eventually all but one node was oom killed. The logs showed persistent liveness errors and lots of duplicate connection gossip errors:

I201119 00:09:36.443177 8449909 gossip/client.go:124 ⋮ [n6] started gossip client to ‹10.12.21.44:26257›
I201119 00:09:36.444866 8449909 gossip/client.go:129 ⋮ [n6] closing client to n3 (‹10.12.21.44:26257›): ‹rpc error: code = Unknown desc = duplicate connection from node at 10.12.28.144:26257›
I201119 00:09:37.763208 8451167 gossip/server.go:277 ⋮ [n6] refusing gossip from n3 (max 3 conns); forwarding to n1 (‹10.12.16.145:26257›)
I201119 00:09:38.474971 8451309 gossip/client.go:124 ⋮ [n6] started gossip client to ‹10.12.27.108:26257›
I201119 00:09:38.485140 8451309 gossip/client.go:129 ⋮ [n6] closing client to n4 (‹10.12.27.108:26257›): stopping outgoing client to n4 (‹10.12.27.108:26257›); already have incoming
I201119 00:09:38.972033 8452297 gossip/server.go:277 ⋮ [n6] refusing gossip from n2 (max 3 conns); forwarding to n4 (‹10.12.27.108:26257›)
I201119 00:09:39.478998 8451560 gossip/client.go:124 ⋮ [n6] started gossip client to ‹10.12.23.41:26257›

lunevalex · 2020-11-19T22:27:19Z

lunevalex · 2020-11-20T16:02:22Z

This maybe the same issue with the raft scheduler that we have seen here #56851 given the number of ranges.

nvanbenschoten · 2020-11-20T22:40:58Z

#56943 is introducing a new raft.scheduler.latency metric, which will allow us to verify whether Raft scheduler latency is the issue here if we decide to run this experiment again.

github-actions · 2023-09-06T11:09:08Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

jbowens · 2023-09-06T15:05:10Z

I'll close this out, and we should revisit node density as a part of the planned scalability limits work (cc @williamkulju). I don't think this one data point from 3 years ago is providing much context, and we'll hopefully be able to identify specific concrete obstacles to high node density in that work.

jbowens added the A-kv Anything in KV that doesn't belong in a more specific category. label Nov 18, 2020

jbowens added the C-performance Perf of queries or internals. Solution not expected to change functional behavior. label Nov 18, 2020

jlinder added the T-kv KV Team label Jun 16, 2021

github-actions bot added the no-issue-activity label Sep 6, 2023

jbowens closed this as not planned Won't fix, can't repro, duplicate, stale Sep 6, 2023

exalate-issue-sync bot closed this as completed Sep 6, 2023

github-project-automation bot added this to KV Aug 28, 2024

github-project-automation bot moved this to Closed in KV Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: large-disk cluster has poor performance post-restart #56876

kv: large-disk cluster has poor performance post-restart #56876

jbowens commented Nov 18, 2020 •

edited by cockroach-jira-scripts

Loading

blathers-crl bot commented Nov 18, 2020

jbowens commented Nov 19, 2020

lunevalex commented Nov 19, 2020

lunevalex commented Nov 20, 2020

nvanbenschoten commented Nov 20, 2020

github-actions bot commented Sep 6, 2023

jbowens commented Sep 6, 2023

kv: large-disk cluster has poor performance post-restart #56876

kv: large-disk cluster has poor performance post-restart #56876

Comments

jbowens commented Nov 18, 2020 • edited by cockroach-jira-scripts Loading

blathers-crl bot commented Nov 18, 2020

jbowens commented Nov 19, 2020

lunevalex commented Nov 19, 2020

lunevalex commented Nov 20, 2020

nvanbenschoten commented Nov 20, 2020

github-actions bot commented Sep 6, 2023

jbowens commented Sep 6, 2023

jbowens commented Nov 18, 2020 •

edited by cockroach-jira-scripts

Loading