-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nodes consistently OOMS during index backfill on large cluster #97801
Comments
copy-paste from slack discussion: |
Two things to note:
|
Much internal discussion which prompted this issue here. |
Copying some internal notes.
|
I don't think the DR team currently has any plans on rebuilding the 96 node cluster again. |
@aadityasondhi could you please see if this is still reproducible as part of the large-scale testing of replication AC ? if it is, we can decide where this belongs. Thank you! |
Regarding #97801 (comment), we didn't have any plans for "large-scale testing of replication AC". Any problems with replication AC should be as reproducible in a small cluster as a large cluster and we have done the small cluster experiments via roachtests. |
Describe the problem
While initializing the TPCE workload on a 96 node cluster with 10 million users, nodes would consistently OOM while backfilling an index on one of the largest tables
tpce.public.trade
.This is the heap profile of an example node that was using a lot of memory while creating the index:
To Reproduce
I created a 96 node cluster (+ 1 workload node)
Started the TPCE workload for 10M customers with
--init
Eventually, the table imports succeed and the indexes will be created. When the index for
tpce.public.trade
starts, nodes will consistently die due to OOMs and I would have to restart the cockroach process in order for the index job to proceed.Expected behavior
A clear and concise description of what you expected to happen.
Additional data / screenshots
debug zip: gs://rui-backup-test/debug/large-cluster/debug.zip
tsdump: gs://rui-backup-test/debug/large-cluster/tsdump.gob
Environment:
Jira issue: CRDB-24895
The text was updated successfully, but these errors were encountered: