New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Reducing resource requests #42

Open

drewoldag wants to merge 3 commits into main from awo/trimming-resource-requests

Collaborator

drewoldag commented Sep 10, 2024 •

edited

Loading

Took a first pass at reducing the resources to be requested for the various workflow tasks. I feel fairly confident that these numbers are reasonable. The biggest question mark is the amount of memory to request for the kbmod_search step.

I've reduced it from 512GB to 128GB, but that still feels generally high. My guess is that we could probably get away with something like 2.5-3x the size of the total work unit being processed, and if we're maxing out an A40 for the largest work units, that would be 48GB, so perhaps 128GB of memory isn't out of the question. But if the majority of the workunits can fit on a 2080ti with 11GB of memory, then we can significantly reduce the requested memory, perhaps to 32GB.


          Took a first pass at reducing the resources to be requested for the v…

c8c2023

…arious workflow tasks.

drewoldag self-assigned this

drewoldag commented

View reviewed changes

src/kbmod_wf/resource_configs/klone_configuration.py

    
                  "sharded_reproject": "04:00:00",

                  "gpu_max": "08:00:00",

                  "sharded_reproject": "01:00:00",

                  "gpu_max": "01:00:00",

Collaborator Author

drewoldag Sep 10, 2024

I reduced the time requested for each of these. @DinoBektesevic I think that 1hr should generally be enough to finish a search, but let me know if this should be pushed back up.

drewoldag commented

View reviewed changes

src/kbmod_wf/resource_configs/klone_configuration.py

@@ @@ -21,7 +21,7 @@ def klone_resource_config(): @@
                           os.path.join("/gscratch/dirac/kbmod/workflow/run_logs", datetime.date.today().isoformat())
                       ),
                       run_dir=os.path.join("/gscratch/dirac/kbmod/workflow/run_logs", datetime.date.today().isoformat()),
-                      retries=1,
+                      retries=100,

Collaborator Author

drewoldag Sep 10, 2024

Until we have a good way to catch and ignore pre-emption "failures" that would increment the retry counter, we can naively set the max retry number of something large.

drewoldag commented

View reviewed changes

src/kbmod_wf/resource_configs/klone_configuration.py

@@ @@ -35,14 +35,15 @@ def klone_resource_config(): @@
                                   parallelism=1,
                                   nodes_per_block=1,
                                   cores_per_node=1,  # perhaps should be 8???
-                                  mem_per_node=256,  # In GB
+                                  mem_per_node=32,  # In GB

Collaborator Author

drewoldag Sep 10, 2024

This executor is only used by the pre-TNO workflow to convert the URI file into an ImageCollection. So we probably never needed anywhere near the memory that was requested.

drewoldag commented

View reviewed changes

src/kbmod_wf/resource_configs/klone_configuration.py

    
                                  cores_per_node=32,

                                  mem_per_node=128,  # ~2-4 GB per core

                                  cores_per_node=8,

                                  mem_per_node=32,  # ~2-4 GB per core

Collaborator Author

drewoldag Sep 10, 2024

In this executor we're cranking up the maximum number of concurrent jobs running and decreasing the cores per node and memory.

drewoldag commented

View reviewed changes

src/kbmod_wf/resource_configs/klone_configuration.py Outdated

    
                                  cores_per_node=2,  # perhaps should be 8???

                                  mem_per_node=512,  # In GB

                                  cores_per_node=1,

                                  mem_per_node=128,  # In GB

Collaborator Author

drewoldag Sep 10, 2024

Similarly here, we're reducing the number of cores and memory for the kbmod search step.

drewoldag and others added 2 commits

September 10, 2024 15:56


          Need to update the n_workers parameter in the reprojection config a…

38a11ad

…s well.


          Additional tweaks to the klone config.

105afdb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet