implement partial release of resources #1151

garlick · 2024-03-07T14:58:54Z

Problem: as discussed in flux-framework/flux-core#4312, the original plan for partial release of resources was to give the scheduler a free RPC for each R fragment of a job's resources that can be returned to the pool. In fluxion, the R is ignored in the free callback and the jobid is used instead to free all resources allocated to the job.

An additional problem is that flux-core cannot fragment the contents of the opaque scheduling key in R.

Assuming we figure out a way in flux-core to release resource in parts, how can this be made to work in fluxion?

Note that RFC 27 would need to be updated as it currently describes a single free RPC.

The text was updated successfully, but these errors were encountered:

garlick · 2024-03-07T16:38:19Z

A fragment would contain all the resources allocated to the job on one or more execution targets (broker ranks). That is, it would not be further subdivided.

One thought on the JGF problem is that perhaps a combination of job ID and the list of execution target ids from Rv1 would be sufficient to identify the resources being freed.

Problem: R has to be looked up from the KVS in the sched.free request handler, but now that the job manager caches R, this is an unnecessary extra step. Add R to the sched.free request payload. Note that the `R.scheduling` key is not included. The current design of Fluxion in which `R.scheduling` may contain a voluminous JGF object made caching this part of R impractical. Change libschedutil so that - the sched.free message handler never looks up R in the kvs - the free callback always sets its `R` argument to NULL - the SCHEDUTIL_FREE_NOLOOKUP flag is a no-op Update sched-simple's free callback to unpack R from the message instead of decoding the `R` arugment. Note that Fluxion sets SCHEDUTIL_FREE_NOLOOKUP so it already expects the free callback's R argument to be NULL. Although this change increases the size of sched.free payloads with data that Fluxion currently does not use, the ranks in R will be required by Fluxion in the future to identify resource subsets for partial release (flux-framework/flux-sched#1151). This change should be accompanied by an update to RFC 27. Update sched-simple unit test. Fixes flux-framework#5775

Problem: jobs get stuck in CLEANUP state while long epilog scripts run, causing sadness and idling resources. Introduce a new type of epilog script called "housekeeping" that is ostensibly job independent. Instead of freeing resources directly to the scheduler, jobs free resources to housekeeping, post their free event, and may reach INACTIVE state. Meanwhile, housekeeping can run a script on the allocated resources and return the resources to the scheduler when complete. The resources are still allocated to the job as far as the scheduler is concerned while housekeeping runs. However since the job has transitioned to INACTIVE, the flux-accounting plugin will decrement the running job count for the user and stop billing the user for the resources. 'flux resource list' utility shows the resources as allocated. By default, resources are released all at once to the scheduler, as before. However, if configured, resources can be freed to the scheduler immediately as they complete housekeeping on each execution target, or a timer can be started on completion of the first target, and when the timer expires, all the targets that have completed thus far are freed in one go. Following that, resources are freed to the scheduler immediately as they complete. This works with sched-simple without changes, with the exception that the hello protocol does not currently support partial release so, as noted in the code, housekeeping and a new job could overlap when the scheduler is reloaded on a live system. Some RFC 27 work is needed to resolve ths. The Fluxion scheduler does not currently support partial release (flux-framework/flux-sched#1151). But as discussed over there, the combination of receiving an R fragment and a jobid in the free request should be sufficient to get that working.

milroy · 2024-03-28T06:28:11Z

With the assumption that a fragment contains a subset of the job's broker ranks but the entire R (i.e., the full R for each broker rank) for each fragment broker rank, adding this support should be straightforward.

Mainly what's needed is to identify the broker ranks in the R fragment and iterate through the vertices in the by_rank graph metadata map for each rank. Then remove the scheduling and planner data per vertex. Updating the vertices' ancestors' pruning filters will require some thought, though...

Problem: jobs get stuck in CLEANUP state while long epilog scripts run, causing sadness and idling resources. Introduce a new type of epilog script called "housekeeping" that is ostensibly job independent. Instead of freeing resources directly to the scheduler, jobs free resources to housekeeping, post their free event, and may reach INACTIVE state. Meanwhile, housekeeping can run a script on the allocated resources and return the resources to the scheduler when complete. The resources are still allocated to the job as far as the scheduler is concerned while housekeeping runs. However since the job has transitioned to INACTIVE, the flux-accounting plugin will decrement the running job count for the user and stop billing the user for the resources. 'flux resource list' utility shows the resources as allocated. By default, resources are released all at once to the scheduler, as before. However, if configured, resources can be freed to the scheduler immediately as they complete housekeeping on each execution target, or a timer can be started on completion of the first target, and when the timer expires, all the targets that have completed thus far are freed in one go. Following that, resources are freed to the scheduler immediately as they complete. This works with sched-simple without changes, with the exception that the hello protocol does not currently support partial release so, as noted in the code, housekeeping and a new job could overlap when the scheduler is reloaded on a live system. Some RFC 27 work is needed to resolve ths. The Fluxion scheduler does not currently support partial release (flux-framework/flux-sched#1151). But as discussed over there, the combination of receiving an R fragment and a jobid in the free request should be sufficient to get that working.

Problem: Fluxion issue flux-framework#1151 and flux-core issue flux-framework/flux-core#4312 identified the need for partial release of resources. The current functionality need is to release all resources managed by a single borker rank. In the future support for releasing arbitrary subgraphs will be needed for cloud and converged use cases. Modify the rem_* traverser functions to take a modification type and type_to_count unordered_map. Add logic in the recursive job modification calls to distinguish between a full and partial job cancellation and issue corresponding planner interface calls, handling errors as needed.

Problem: jobs get stuck in CLEANUP state while long epilog scripts run, causing sadness and idling resources. Introduce a new type of epilog script called "housekeeping" that is ostensibly job independent. Instead of freeing resources directly to the scheduler, jobs free resources to housekeeping, post their free event, and may reach INACTIVE state. Meanwhile, housekeeping can run a script on the allocated resources and return the resources to the scheduler when complete. The resources are still allocated to the job as far as the scheduler is concerned while housekeeping runs. However since the job has transitioned to INACTIVE, the flux-accounting plugin will decrement the running job count for the user and stop billing the user for the resources. 'flux resource list' utility shows the resources as allocated. By default, resources are released all at once to the scheduler, as before. However, if configured, resources can be freed to the scheduler immediately as they complete housekeeping on each execution target, or a timer can be started on completion of the first target, and when the timer expires, all the targets that have completed thus far are freed in one go. Following that, resources are freed to the scheduler immediately as they complete. This works with sched-simple without changes, with the exception that the hello protocol does not currently support partial release so, as noted in the code, housekeeping and a new job could overlap when the scheduler is reloaded on a live system. Some RFC 27 work is needed to resolve ths. The Fluxion scheduler does not currently support partial release (flux-framework/flux-sched#1151). But as discussed over there, the combination of receiving an R fragment and a jobid in the free request should be sufficient to get that working.

Problem: Fluxion issue flux-framework#1151 and flux-core issue flux-framework/flux-core#4312 identified the need for partial release of resources. The current functionality need is to release all resources managed by a single broker rank. In the future support for releasing arbitrary subgraphs will be needed for cloud and converged use cases. Modify the rem_* traverser functions to take a modification type and type_to_count unordered_map. Add logic in the recursive job modification calls to distinguish between a full and partial job cancellation and issue corresponding planner interface calls, handling errors as needed.

Problem: jobs get stuck in CLEANUP state while long epilog scripts run, causing sadness and idling resources. Introduce a new type of epilog script called "housekeeping" that is ostensibly job independent. Instead of freeing resources directly to the scheduler, jobs free resources to housekeeping, post their free event, and may reach INACTIVE state. Meanwhile, housekeeping can run a script on the allocated resources and return the resources to the scheduler when complete. The resources are still allocated to the job as far as the scheduler is concerned while housekeeping runs. However since the job has transitioned to INACTIVE, the flux-accounting plugin will decrement the running job count for the user and stop billing the user for the resources. 'flux resource list' utility shows the resources as allocated. By default, resources are released all at once to the scheduler, as before. However, if configured, resources can be freed to the scheduler immediately as they complete housekeeping on each execution target, or a timer can be started on completion of the first target, and when the timer expires, all the targets that have completed thus far are freed in one go. Following that, resources are freed to the scheduler immediately as they complete. This works with sched-simple without changes, with the exception that the hello protocol does not currently support partial release so, as noted in the code, housekeeping and a new job could overlap when the scheduler is reloaded on a live system. Some RFC 27 work is needed to resolve ths. The Fluxion scheduler does not currently support partial release (flux-framework/flux-sched#1151). But as discussed over there, the combination of receiving an R fragment and a jobid in the free request should be sufficient to get that working.

Problem: Fluxion issue flux-framework#1151 and flux-core issue flux-framework/flux-core#4312 identified the need for partial release of resources. The current functionality need is to release all resources managed by a single broker rank. In the future support for releasing arbitrary subgraphs will be needed for cloud and converged use cases. Modify the rem_* traverser functions to take a modification type and type_to_count unordered_map. Add logic in the recursive job modification calls to distinguish between a full and partial job cancellation and issue corresponding planner interface calls, handling errors as needed.

Problem: jobs get stuck in CLEANUP state while long epilog scripts run, causing sadness and idling resources. Introduce a new type of epilog script called "housekeeping" that runs after the job. Instead of freeing resources directly to the scheduler, jobs free resources to housekeeping, post their free event, and may reach INACTIVE state. Meanwhile, housekeeping can run a script on the allocated resources and return the resources to the scheduler when complete. The resources are still allocated to the job as far as the scheduler is concerned while housekeeping runs. However since the job has transitioned to INACTIVE, the flux-accounting plugin will decrement the running job count for the user and stop billing the user for the resources. 'flux resource list' utility shows the resources as allocated. By default, resources are released to the scheduler only after all ranks complete housekeeping, as before. However, if configured, resources can be freed to the scheduler immediately as they complete housekeeping on each execution target, or a timer can be started on completion of the first target, and when the timer expires, all the targets that have completed thus far are freed in one go. Following that, resources are freed to the scheduler immediately as they complete. This works with sched-simple without changes, with the exception that the hello protocol does not currently support partial release so, as noted in the code, housekeeping and a new job could overlap when the scheduler is reloaded on a live system. Some RFC 27 work is needed to resolve ths. The Fluxion scheduler does not currently support partial release (flux-framework/flux-sched#1151). But as discussed over there, the combination of receiving an R fragment and a jobid in the free request should be sufficient to get that working.

jameshcorbett · 2024-07-01T20:59:06Z

With the assumption that a fragment contains a subset of the job's broker ranks but the entire R (i.e., the full R for each broker rank) for each fragment broker rank, adding this support should be straightforward.

Mainly what's needed is to identify the broker ranks in the R fragment and iterate through the vertices in the by_rank graph metadata map for each rank.

My understanding is that on elcap systems, the scheduler will need to be initialized from JGF in order to understand rabbit layout. Also, it will need to emit JGF for jobs in order to facilitate scheduler restart. The partial release will come in the form of R but that's OK because of this simplifying assumption right?

Problem: jobs get stuck in CLEANUP state while long epilog scripts run, causing sadness and idling resources. Introduce a new type of epilog script called "housekeeping" that runs after the job. Instead of freeing resources directly to the scheduler, jobs free resources to housekeeping, post their free event, and may reach INACTIVE state. Meanwhile, housekeeping can run a script on the allocated resources and return the resources to the scheduler when complete. The resources are still allocated to the job as far as the scheduler is concerned while housekeeping runs. However since the job has transitioned to INACTIVE, the flux-accounting plugin will decrement the running job count for the user and stop billing the user for the resources. 'flux resource list' utility shows the resources as allocated. By default, resources are released to the scheduler only after all ranks complete housekeeping, as before. However, if configured, resources can be freed to the scheduler immediately as they complete housekeeping on each execution target, or a timer can be started on completion of the first target, and when the timer expires, all the targets that have completed thus far are freed in one go. Following that, resources are freed to the scheduler immediately as they complete. This works with sched-simple without changes, with the exception that the hello protocol does not currently support partial release so, as noted in the code, housekeeping and a new job could overlap when the scheduler is reloaded on a live system. Some RFC 27 work is needed to resolve ths. The Fluxion scheduler does not currently support partial release (flux-framework/flux-sched#1151). But as discussed over there, the combination of receiving an R fragment and a jobid in the free request should be sufficient to get that working.

Problem: Fluxion issue flux-framework#1151 and flux-core issue flux-framework/flux-core#4312 identified the need for partial release of resources. The current functionality need is to release all resources managed by a single broker rank. In the future support for releasing arbitrary subgraphs will be needed for cloud and converged use cases. Modify the rem_* traverser functions to take a modification type and type_to_count unordered_map. Add logic in the recursive job modification calls to distinguish between a full and partial job cancellation and issue corresponding planner interface calls, handling errors as needed.

Problem: Fluxion issue flux-framework#1151 and flux-core issue flux-framework/flux-core#4312 identified the need for partial release of resources. The current functionality need is to release all resources managed by a single broker rank. In the future support for releasing arbitrary subgraphs will be needed for cloud and converged use cases. Modify the rem_* traverser functions to take a modification type and type_to_count unordered_map. Add logic in the recursive job modification calls to distinguish between a full and partial job cancellation and issue corresponding planner interface calls, handling errors as needed. Switch cancallation behavior based on the job_modify_t enum class.

Problem: Fluxion issue flux-framework#1151 and flux-core issue flux-framework/flux-core#4312 identified the need for partial release of resources. The current functionality need is to release all resources managed by a single broker rank. In the future support for releasing arbitrary subgraphs will be needed for cloud and converged use cases. Modify the rem_* traverser functions to take a modification type and type_to_count unordered_map. Add logic in the recursive job modification calls to distinguish between a full and partial job cancellation and issue corresponding planner interface calls, handling errors as needed.

Problem: Fluxion issue flux-framework#1151 and flux-core issue flux-framework/flux-core#4312 identified the need for partial release of resources. The current functionality need is to release all resources managed by a single broker rank. In the future support for releasing arbitrary subgraphs will be needed for cloud and converged use cases. Modify the rem_* traverser functions to take a modification type and type_to_count unordered_map. Add logic in the recursive job modification calls to distinguish between a full and partial job cancellation and issue corresponding planner interface calls, handling errors as needed. Switch cancallation behavior based on the job_modify_t enum class.

milroy · 2024-07-10T02:03:20Z

The partial release will come in the form of R but that's OK because of this simplifying assumption right?

That's correct. The partial cancel/release just uses the Rlite fragment string contained in the free RPC payload.

adding this support should be straightforward.

Famous last words. Fortunately the PR is merged and the functionality is in Fluxion now.

trws · 2024-07-31T00:40:40Z

@milroy, it looks like this one can be closed, so I'm closing it. If there's something we need to keep open here feel free to re-open.

garlick mentioned this issue Mar 7, 2024

need a way for job manager epilog to implement "partial release" flux-framework/flux-core#4312

Open

garlick mentioned this issue Mar 7, 2024

job-manager: send job->R_redacted directly to the scheduler in sched.free flux-framework/flux-core#5775

Closed

garlick mentioned this issue Mar 11, 2024

job-manager: include R in sched.free request flux-framework/flux-core#5783

Merged

garlick mentioned this issue Mar 21, 2024

job-manager: add support for housekeeping scripts with partial release of resources flux-framework/flux-core#5818

Merged

milroy self-assigned this Mar 28, 2024

milroy mentioned this issue Apr 8, 2024

Add support for broker rank-based partial release #1163

Merged

5 tasks

trws closed this as completed Jul 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement partial release of resources #1151

implement partial release of resources #1151

garlick commented Mar 7, 2024

garlick commented Mar 7, 2024

milroy commented Mar 28, 2024

jameshcorbett commented Jul 1, 2024

milroy commented Jul 10, 2024

trws commented Jul 31, 2024

implement partial release of resources #1151

implement partial release of resources #1151

Comments

garlick commented Mar 7, 2024

garlick commented Mar 7, 2024

milroy commented Mar 28, 2024

jameshcorbett commented Jul 1, 2024

milroy commented Jul 10, 2024

trws commented Jul 31, 2024