Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reducing resource requests #42

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

drewoldag
Copy link
Collaborator

@drewoldag drewoldag commented Sep 10, 2024

Took a first pass at reducing the resources to be requested for the various workflow tasks. I feel fairly confident that these numbers are reasonable. The biggest question mark is the amount of memory to request for the kbmod_search step.

I've reduced it from 512GB to 128GB, but that still feels generally high. My guess is that we could probably get away with something like 2.5-3x the size of the total work unit being processed, and if we're maxing out an A40 for the largest work units, that would be 48GB, so perhaps 128GB of memory isn't out of the question. But if the majority of the workunits can fit on a 2080ti with 11GB of memory, then we can significantly reduce the requested memory, perhaps to 32GB.

@drewoldag drewoldag self-assigned this Sep 10, 2024
"sharded_reproject": "04:00:00",
"gpu_max": "08:00:00",
"sharded_reproject": "01:00:00",
"gpu_max": "01:00:00",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reduced the time requested for each of these. @DinoBektesevic I think that 1hr should generally be enough to finish a search, but let me know if this should be pushed back up.

@@ -21,7 +21,7 @@ def klone_resource_config():
os.path.join("/gscratch/dirac/kbmod/workflow/run_logs", datetime.date.today().isoformat())
),
run_dir=os.path.join("/gscratch/dirac/kbmod/workflow/run_logs", datetime.date.today().isoformat()),
retries=1,
retries=100,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until we have a good way to catch and ignore pre-emption "failures" that would increment the retry counter, we can naively set the max retry number of something large.

@@ -35,14 +35,15 @@ def klone_resource_config():
parallelism=1,
nodes_per_block=1,
cores_per_node=1, # perhaps should be 8???
mem_per_node=256, # In GB
mem_per_node=32, # In GB
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This executor is only used by the pre-TNO workflow to convert the URI file into an ImageCollection. So we probably never needed anywhere near the memory that was requested.

cores_per_node=32,
mem_per_node=128, # ~2-4 GB per core
cores_per_node=8,
mem_per_node=32, # ~2-4 GB per core
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this executor we're cranking up the maximum number of concurrent jobs running and decreasing the cores per node and memory.

cores_per_node=2, # perhaps should be 8???
mem_per_node=512, # In GB
cores_per_node=1,
mem_per_node=128, # In GB
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here, we're reducing the number of cores and memory for the kbmod search step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant