Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: large-disk cluster has poor performance post-restart #56876

Closed
jbowens opened this issue Nov 18, 2020 · 7 comments
Closed

kv: large-disk cluster has poor performance post-restart #56876

jbowens opened this issue Nov 18, 2020 · 7 comments
Labels
A-kv Anything in KV that doesn't belong in a more specific category. C-performance Perf of queries or internals. Solution not expected to change functional behavior. no-issue-activity T-kv KV Team

Comments

@jbowens
Copy link
Collaborator

jbowens commented Nov 18, 2020

Reproduction steps

Create a cluster with 10 TB disks, eg:

roachprod create jackson-bank-200b --username jackson --clouds aws -n 8 \
  --local-ssd=false --aws-machine-type c5.4xlarge --aws-ebs-volume-size=10000
roachprod put    jackson-bank-200b ./cockroach-linux-2.6.32-gnu-amd64 ./cockroach
roachprod start  jackson-bank-200b
roachprod sql    jackson-bank-200b:1 -- -e \
   "set cluster setting kv.range_merge.queue_enabled = false"
roachprod sql    jackson-bank-200b:1 -- -e \
   "set cluster setting kv.bulk_io_write.concurrent_addsstable_requests = 8"

Import a large bank dataset. (This takes ~28 hours.)

roachprod run    jackson-bank-200b:1 -- \
   "./cockroach workload fixtures import bank --rows 200000000000 {pgurl:1} >/dev/null 2>&1 &"

Run the bank workload:

roachprod stop jackson-bank-200b
roachprod start jackson-bank-200b
roachprod run jackson-bank-200b:1 -- \
    ./cockroach workload run bank --ramp=5m --rows=200000000000 --duration=15m

Throughput is around ~40 ops/sec. See https://docs.google.com/document/d/1rfWNGFZ6gulKqb6BMXeMzq9GkMgdVfXnC4SJ3FM3ilE/edit?usp=sharing
Screen Shot 2020-11-18 at 6 39 28 PM

Jira issue: CRDB-2897

@jbowens jbowens added the A-kv Anything in KV that doesn't belong in a more specific category. label Nov 18, 2020
@blathers-crl
Copy link

blathers-crl bot commented Nov 18, 2020

Hi @jbowens, please add a C-ategory label to your issue. Check out the label system docs.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@jbowens jbowens added the C-performance Perf of queries or internals. Solution not expected to change functional behavior. label Nov 18, 2020
@jbowens
Copy link
Collaborator Author

jbowens commented Nov 19, 2020

I left this cluster for a bit and eventually all but one node was oom killed. The logs showed persistent liveness errors and lots of duplicate connection gossip errors:

I201119 00:09:36.443177 8449909 gossip/client.go:124 ⋮ [n6] started gossip client to ‹10.12.21.44:26257›
I201119 00:09:36.444866 8449909 gossip/client.go:129 ⋮ [n6] closing client to n3 (‹10.12.21.44:26257›): ‹rpc error: code = Unknown desc = duplicate connection from node at 10.12.28.144:26257›
I201119 00:09:37.763208 8451167 gossip/server.go:277 ⋮ [n6] refusing gossip from n3 (max 3 conns); forwarding to n1 (‹10.12.16.145:26257›)
I201119 00:09:38.474971 8451309 gossip/client.go:124 ⋮ [n6] started gossip client to ‹10.12.27.108:26257›
I201119 00:09:38.485140 8451309 gossip/client.go:129 ⋮ [n6] closing client to n4 (‹10.12.27.108:26257›): stopping outgoing client to n4 (‹10.12.27.108:26257›); already have incoming
I201119 00:09:38.972033 8452297 gossip/server.go:277 ⋮ [n6] refusing gossip from n2 (max 3 conns); forwarding to n4 (‹10.12.27.108:26257›)
I201119 00:09:39.478998 8451560 gossip/client.go:124 ⋮ [n6] started gossip client to ‹10.12.23.41:26257›

@lunevalex
Copy link
Collaborator

Screen Shot 2020-11-18 at 6 38 26 PM

@lunevalex
Copy link
Collaborator

This maybe the same issue with the raft scheduler that we have seen here #56851 given the number of ranges.

@nvanbenschoten
Copy link
Member

#56943 is introducing a new raft.scheduler.latency metric, which will allow us to verify whether Raft scheduler latency is the issue here if we decide to run this experiment again.

@jlinder jlinder added the T-kv KV Team label Jun 16, 2021
@github-actions
Copy link

github-actions bot commented Sep 6, 2023

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@jbowens
Copy link
Collaborator Author

jbowens commented Sep 6, 2023

I'll close this out, and we should revisit node density as a part of the planned scalability limits work (cc @williamkulju). I don't think this one data point from 3 years ago is providing much context, and we'll hopefully be able to identify specific concrete obstacles to high node density in that work.

@jbowens jbowens closed this as not planned Won't fix, can't repro, duplicate, stale Sep 6, 2023
@github-project-automation github-project-automation bot moved this to Closed in KV Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv Anything in KV that doesn't belong in a more specific category. C-performance Perf of queries or internals. Solution not expected to change functional behavior. no-issue-activity T-kv KV Team
Projects
No open projects
Archived in project
Development

No branches or pull requests

4 participants