Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nodes consistently OOMS during index backfill on large cluster #97801

Closed
rhu713 opened this issue Feb 28, 2023 · 7 comments
Closed

Nodes consistently OOMS during index backfill on large cluster #97801

rhu713 opened this issue Feb 28, 2023 · 7 comments
Assignees
Labels
A-admission-control A-kv Anything in KV that doesn't belong in a more specific category. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-admission-control Admission Control

Comments

@rhu713
Copy link
Contributor

rhu713 commented Feb 28, 2023

Describe the problem
While initializing the TPCE workload on a 96 node cluster with 10 million users, nodes would consistently OOM while backfilling an index on one of the largest tables tpce.public.trade.

This is the heap profile of an example node that was using a lot of memory while creating the index:
Screenshot 2023-02-28 at 1 24 25 PM

To Reproduce
I created a 96 node cluster (+ 1 workload node)

roachprod create rui-backup-test-large  -n 97 --gce-machine-type n2-standard-16  --local-ssd=false  --gce-pd-volume-size=4096 --gce-zones us-central1-a

on node 97:
roachprod start rui-backup-test-large  --racks=96 --env COCKROACH_ROCKSDB_CONCURRENCY=16 

Started the TPCE workload for 10M customers with --init

sudo docker run cockroachdb/tpc-e:latest --init  --customers=10000000 --racks=96 $(cat hosts.txt) -d 10d

Eventually, the table imports succeed and the indexes will be created. When the index for tpce.public.trade starts, nodes will consistently die due to OOMs and I would have to restart the cockroach process in order for the index job to proceed.

Expected behavior
A clear and concise description of what you expected to happen.

Additional data / screenshots
debug zip: gs://rui-backup-test/debug/large-cluster/debug.zip
tsdump: gs://rui-backup-test/debug/large-cluster/tsdump.gob

Environment:

  • CockroachDB version: 22.2

Jira issue: CRDB-24895

@rhu713 rhu713 added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv Anything in KV that doesn't belong in a more specific category. labels Feb 28, 2023
@sumeerbhola
Copy link
Collaborator

copy-paste from slack discussion:
[sumeer] Regarding AddSSTable queueing pre-evaluation and causing OOMs, the queueing is always a risk. One problem we postponed solving is early rejecting requests if memory consumption of queued requests was too high (and letting the client retry after backoff). The assumption behind postponing this was that internal work that generates these memory hungry AddSSTables had limited concurrency and therefore implicit flow control. Is that not the case?
[ajwerner] That is the case, but on a large enough cluster, the concurrency is high.
[sumeer] Because of the fanin?
[ajwerner] Yes, every single node ends up blocked on the slow node eventually.

@msbutler
Copy link
Collaborator

msbutler commented Feb 28, 2023

Two things to note:

  1. Rui did not see these issues in 23.1.
  2. I saw a similar a sawtooth pattern of addsstable requests on the same index backfill on a smaller 22.2 tpce workload, which I documented here schema: potential 2X index backfill perf regression during tpce init #95163

@blathers-crl blathers-crl bot added the T-kv KV Team label Feb 28, 2023
@ajwerner
Copy link
Contributor

Much internal discussion which prompted this issue here.

@irfansharif
Copy link
Contributor

Copying some internal notes.

I’m reading this again. @dt and @rhu713, are we planning to run these 96 node TPC-E runs again? I’m four months too late to this thread and (a) the grafana metrics are gone, and (b) lots of things have changed in 23.2: (i) replication admission control, (ii) removal of addsst concurrency limits in #104861.

Regarding AddSSTable queueing pre-evaluation and causing OOMs, the queueing is always a risk.
Looking at the heap profile posted in #97801, with the concurrency limiter gone, we’ve somewhat reduced the likelihood of OOMs, right? There’s still the fan-in problem, but I wonder if re-running the same experiment will now hit the per-replica proposal quota pool limits of 8MiB, and reduce (but not eliminate) OOM likelihood. We’re no longer queueing on a limiter that’s permitting 1 AddSST at a time (with no view over the memory held by other waiting requests), we’re ingesting them as quickly as AC will let us.

The client<->server protocol changes described in the messages above, they’d still need to happen to completely eliminate server-side OOMs with a large degree of fan-in, right? I can’t really tell how real a problem it is anymore (re-running this same experiment would be the next clarifying step). If clients are only issuing more AddSSTs after previous ones have been processed by the server we’re all fanning into, then after the initial burst of AddSSTs (and depletion of client-side memory budgets), subsequent AddSSTs are only going to be issued in aggregate at the rate the single server is ingesting them, right? So we have flow control? The OOM concerns then are only around the initial burst (which I think should be smaller as far the server is concerned, with the concurrency limiter gone)?

@rhu713
Copy link
Contributor Author

rhu713 commented Sep 22, 2023

I don't think the DR team currently has any plans on rebuilding the 96 node cluster again.

@exalate-issue-sync exalate-issue-sync bot added T-admission-control Admission Control and removed T-kv KV Team labels Oct 19, 2023
@shralex
Copy link
Contributor

shralex commented Oct 19, 2023

@aadityasondhi could you please see if this is still reproducible as part of the large-scale testing of replication AC ? if it is, we can decide where this belongs. Thank you!

@sumeerbhola
Copy link
Collaborator

Regarding #97801 (comment), we didn't have any plans for "large-scale testing of replication AC". Any problems with replication AC should be as reproducible in a small cluster as a large cluster and we have done the small cluster experiments via roachtests.
If someone experiments with a large cluster and there are issues, we can investigate. Meanwhile I am closing this. Feel free to reopen if I have misunderstood something.

@sumeerbhola sumeerbhola closed this as not planned Won't fix, can't repro, duplicate, stale Oct 20, 2023
@github-project-automation github-project-automation bot moved this to Closed in KV Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-admission-control A-kv Anything in KV that doesn't belong in a more specific category. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-admission-control Admission Control
Projects
No open projects
Archived in project
Development

No branches or pull requests

7 participants