Nodes consistently OOMS during index backfill on large cluster #97801

rhu713 · 2023-02-28T18:36:08Z

Describe the problem
While initializing the TPCE workload on a 96 node cluster with 10 million users, nodes would consistently OOM while backfilling an index on one of the largest tables tpce.public.trade.

This is the heap profile of an example node that was using a lot of memory while creating the index:

To Reproduce
I created a 96 node cluster (+ 1 workload node)

roachprod create rui-backup-test-large  -n 97 --gce-machine-type n2-standard-16  --local-ssd=false  --gce-pd-volume-size=4096 --gce-zones us-central1-a

on node 97:
roachprod start rui-backup-test-large  --racks=96 --env COCKROACH_ROCKSDB_CONCURRENCY=16

Started the TPCE workload for 10M customers with --init

sudo docker run cockroachdb/tpc-e:latest --init  --customers=10000000 --racks=96 $(cat hosts.txt) -d 10d

Eventually, the table imports succeed and the indexes will be created. When the index for tpce.public.trade starts, nodes will consistently die due to OOMs and I would have to restart the cockroach process in order for the index job to proceed.

Expected behavior
A clear and concise description of what you expected to happen.

Additional data / screenshots
debug zip: gs://rui-backup-test/debug/large-cluster/debug.zip
tsdump: gs://rui-backup-test/debug/large-cluster/tsdump.gob

Environment:

CockroachDB version: 22.2

Jira issue: CRDB-24895

The text was updated successfully, but these errors were encountered:

sumeerbhola · 2023-02-28T18:40:11Z

copy-paste from slack discussion:
[sumeer] Regarding AddSSTable queueing pre-evaluation and causing OOMs, the queueing is always a risk. One problem we postponed solving is early rejecting requests if memory consumption of queued requests was too high (and letting the client retry after backoff). The assumption behind postponing this was that internal work that generates these memory hungry AddSSTables had limited concurrency and therefore implicit flow control. Is that not the case?
[ajwerner] That is the case, but on a large enough cluster, the concurrency is high.
[sumeer] Because of the fanin?
[ajwerner] Yes, every single node ends up blocked on the slow node eventually.

msbutler · 2023-02-28T18:50:17Z

Two things to note:

Rui did not see these issues in 23.1.
I saw a similar a sawtooth pattern of addsstable requests on the same index backfill on a smaller 22.2 tpce workload, which I documented here schema: potential 2X index backfill perf regression during tpce init #95163

ajwerner · 2023-02-28T19:03:36Z

Much internal discussion which prompted this issue here.

irfansharif · 2023-09-21T15:03:41Z

Copying some internal notes.

I’m reading this again. @dt and @rhu713, are we planning to run these 96 node TPC-E runs again? I’m four months too late to this thread and (a) the grafana metrics are gone, and (b) lots of things have changed in 23.2: (i) replication admission control, (ii) removal of addsst concurrency limits in #104861.

Regarding AddSSTable queueing pre-evaluation and causing OOMs, the queueing is always a risk.
Looking at the heap profile posted in #97801, with the concurrency limiter gone, we’ve somewhat reduced the likelihood of OOMs, right? There’s still the fan-in problem, but I wonder if re-running the same experiment will now hit the per-replica proposal quota pool limits of 8MiB, and reduce (but not eliminate) OOM likelihood. We’re no longer queueing on a limiter that’s permitting 1 AddSST at a time (with no view over the memory held by other waiting requests), we’re ingesting them as quickly as AC will let us.

The client<->server protocol changes described in the messages above, they’d still need to happen to completely eliminate server-side OOMs with a large degree of fan-in, right? I can’t really tell how real a problem it is anymore (re-running this same experiment would be the next clarifying step). If clients are only issuing more AddSSTs after previous ones have been processed by the server we’re all fanning into, then after the initial burst of AddSSTs (and depletion of client-side memory budgets), subsequent AddSSTs are only going to be issued in aggregate at the rate the single server is ingesting them, right? So we have flow control? The OOM concerns then are only around the initial burst (which I think should be smaller as far the server is concerned, with the concurrency limiter gone)?

rhu713 · 2023-09-22T13:27:16Z

I don't think the DR team currently has any plans on rebuilding the 96 node cluster again.

shralex · 2023-10-19T17:13:32Z

@aadityasondhi could you please see if this is still reproducible as part of the large-scale testing of replication AC ? if it is, we can decide where this belongs. Thank you!

sumeerbhola · 2023-10-20T01:26:07Z

Regarding #97801 (comment), we didn't have any plans for "large-scale testing of replication AC". Any problems with replication AC should be as reproducible in a small cluster as a large cluster and we have done the small cluster experiments via roachtests.
If someone experiments with a large cluster and there are issues, we can investigate. Meanwhile I am closing this. Feel free to reopen if I have misunderstood something.

rhu713 added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv Anything in KV that doesn't belong in a more specific category. labels Feb 28, 2023

blathers-crl bot added the T-kv KV Team label Feb 28, 2023

exalate-issue-sync bot assigned irfansharif Mar 20, 2023

exalate-issue-sync bot unassigned irfansharif Sep 11, 2023

exalate-issue-sync bot added T-admission-control Admission Control and removed T-kv KV Team labels Oct 19, 2023

shralex assigned aadityasondhi Oct 19, 2023

aadityasondhi added the A-admission-control label Oct 19, 2023

sumeerbhola closed this as not planned Won't fix, can't repro, duplicate, stale Oct 20, 2023

exalate-issue-sync bot closed this as completed Oct 20, 2023

github-project-automation bot added this to KV Aug 28, 2024

github-project-automation bot moved this to Closed in KV Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nodes consistently OOMS during index backfill on large cluster #97801

Nodes consistently OOMS during index backfill on large cluster #97801

rhu713 commented Feb 28, 2023 •

edited by cockroach-jira-scripts

Loading

sumeerbhola commented Feb 28, 2023

msbutler commented Feb 28, 2023 •

edited

Loading

ajwerner commented Feb 28, 2023

irfansharif commented Sep 21, 2023

rhu713 commented Sep 22, 2023

shralex commented Oct 19, 2023

sumeerbhola commented Oct 20, 2023

Nodes consistently OOMS during index backfill on large cluster #97801

Nodes consistently OOMS during index backfill on large cluster #97801

Comments

rhu713 commented Feb 28, 2023 • edited by cockroach-jira-scripts Loading

sumeerbhola commented Feb 28, 2023

msbutler commented Feb 28, 2023 • edited Loading

ajwerner commented Feb 28, 2023

irfansharif commented Sep 21, 2023

rhu713 commented Sep 22, 2023

shralex commented Oct 19, 2023

sumeerbhola commented Oct 20, 2023

rhu713 commented Feb 28, 2023 •

edited by cockroach-jira-scripts

Loading

msbutler commented Feb 28, 2023 •

edited

Loading