-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reducing resource requests #42
base: main
Are you sure you want to change the base?
Conversation
…arious workflow tasks.
"sharded_reproject": "04:00:00", | ||
"gpu_max": "08:00:00", | ||
"sharded_reproject": "01:00:00", | ||
"gpu_max": "01:00:00", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reduced the time requested for each of these. @DinoBektesevic I think that 1hr should generally be enough to finish a search, but let me know if this should be pushed back up.
@@ -21,7 +21,7 @@ def klone_resource_config(): | |||
os.path.join("/gscratch/dirac/kbmod/workflow/run_logs", datetime.date.today().isoformat()) | |||
), | |||
run_dir=os.path.join("/gscratch/dirac/kbmod/workflow/run_logs", datetime.date.today().isoformat()), | |||
retries=1, | |||
retries=100, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Until we have a good way to catch and ignore pre-emption "failures" that would increment the retry counter, we can naively set the max retry number of something large.
@@ -35,14 +35,15 @@ def klone_resource_config(): | |||
parallelism=1, | |||
nodes_per_block=1, | |||
cores_per_node=1, # perhaps should be 8??? | |||
mem_per_node=256, # In GB | |||
mem_per_node=32, # In GB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This executor is only used by the pre-TNO workflow to convert the URI file into an ImageCollection. So we probably never needed anywhere near the memory that was requested.
cores_per_node=32, | ||
mem_per_node=128, # ~2-4 GB per core | ||
cores_per_node=8, | ||
mem_per_node=32, # ~2-4 GB per core |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this executor we're cranking up the maximum number of concurrent jobs running and decreasing the cores per node and memory.
cores_per_node=2, # perhaps should be 8??? | ||
mem_per_node=512, # In GB | ||
cores_per_node=1, | ||
mem_per_node=128, # In GB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly here, we're reducing the number of cores and memory for the kbmod search step.
Took a first pass at reducing the resources to be requested for the various workflow tasks. I feel fairly confident that these numbers are reasonable. The biggest question mark is the amount of memory to request for the kbmod_search step.
I've reduced it from 512GB to 128GB, but that still feels generally high. My guess is that we could probably get away with something like 2.5-3x the size of the total work unit being processed, and if we're maxing out an A40 for the largest work units, that would be 48GB, so perhaps 128GB of memory isn't out of the question. But if the majority of the workunits can fit on a 2080ti with 11GB of memory, then we can significantly reduce the requested memory, perhaps to 32GB.