Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement partial release of resources #1151

Closed
garlick opened this issue Mar 7, 2024 · 5 comments
Closed

implement partial release of resources #1151

garlick opened this issue Mar 7, 2024 · 5 comments
Assignees

Comments

@garlick
Copy link
Member

garlick commented Mar 7, 2024

Problem: as discussed in flux-framework/flux-core#4312, the original plan for partial release of resources was to give the scheduler a free RPC for each R fragment of a job's resources that can be returned to the pool. In fluxion, the R is ignored in the free callback and the jobid is used instead to free all resources allocated to the job.

An additional problem is that flux-core cannot fragment the contents of the opaque scheduling key in R.

Assuming we figure out a way in flux-core to release resource in parts, how can this be made to work in fluxion?

Note that RFC 27 would need to be updated as it currently describes a single free RPC.

@garlick
Copy link
Member Author

garlick commented Mar 7, 2024

A fragment would contain all the resources allocated to the job on one or more execution targets (broker ranks). That is, it would not be further subdivided.

One thought on the JGF problem is that perhaps a combination of job ID and the list of execution target ids from Rv1 would be sufficient to identify the resources being freed.

garlick added a commit to garlick/flux-core that referenced this issue Mar 11, 2024
Problem: R has to be looked up from the KVS in the sched.free
request handler, but now that the job manager caches R, this
is an unnecessary extra step.

Add R to the sched.free request payload.

Note that the `R.scheduling` key is not included.  The current design of
Fluxion in which `R.scheduling` may contain a voluminous JGF object made
caching this part of R impractical.

Change libschedutil so that
- the sched.free message handler never looks up R in the kvs
- the free callback always sets its `R` argument to NULL
- the SCHEDUTIL_FREE_NOLOOKUP flag is a no-op

Update sched-simple's free callback to unpack R from the message
instead of decoding the `R` arugment.

Note that Fluxion sets SCHEDUTIL_FREE_NOLOOKUP so it already expects
the free callback's R argument to be NULL.  Although this change increases
the size of sched.free payloads with data that Fluxion currently does not
use, the ranks in R will be required by Fluxion in the future to identify
resource subsets for partial release (flux-framework/flux-sched#1151).

This change should be accompanied by an update to RFC 27.

Update sched-simple unit test.

Fixes flux-framework#5775
garlick added a commit to garlick/flux-core that referenced this issue Mar 11, 2024
Problem: R has to be looked up from the KVS in the sched.free
request handler, but now that the job manager caches R, this
is an unnecessary extra step.

Add R to the sched.free request payload.

Note that the `R.scheduling` key is not included.  The current design of
Fluxion in which `R.scheduling` may contain a voluminous JGF object made
caching this part of R impractical.

Change libschedutil so that
- the sched.free message handler never looks up R in the kvs
- the free callback always sets its `R` argument to NULL
- the SCHEDUTIL_FREE_NOLOOKUP flag is a no-op

Update sched-simple's free callback to unpack R from the message
instead of decoding the `R` arugment.

Note that Fluxion sets SCHEDUTIL_FREE_NOLOOKUP so it already expects
the free callback's R argument to be NULL.  Although this change increases
the size of sched.free payloads with data that Fluxion currently does not
use, the ranks in R will be required by Fluxion in the future to identify
resource subsets for partial release (flux-framework/flux-sched#1151).

This change should be accompanied by an update to RFC 27.

Update sched-simple unit test.

Fixes flux-framework#5775
garlick added a commit to garlick/flux-core that referenced this issue Mar 21, 2024
Problem: jobs get stuck in CLEANUP state while long epilog
scripts run, causing sadness and idling resources.

Introduce a new type of epilog script called "housekeeping" that is
ostensibly job independent.  Instead of freeing resources directly
to the scheduler, jobs free resources to housekeeping, post their free
event, and may reach INACTIVE state.  Meanwhile, housekeeping can run
a script on the allocated resources and return the resources to the
scheduler when complete. The resources are still allocated to the job
as far as the scheduler is concerned while housekeeping runs.  However
since the job has transitioned to INACTIVE, the flux-accounting plugin
will decrement the running job count for the user and stop billing
the user for the resources.  'flux resource list' utility shows the
resources as allocated.

By default, resources are released all at once to the scheduler, as before.
However, if configured, resources can be freed to the scheduler immediately
as they complete housekeeping on each execution target, or a timer can be
started on completion of the first target, and when the timer expires, all
the targets that have completed thus far are freed in one go. Following that,
resources are freed to the scheduler immediately as they complete.

This works with sched-simple without changes, with the exception that the
hello protocol does not currently support partial release so, as noted in
the code, housekeeping and a new job could overlap when the scheduler is
reloaded on a live system.  Some RFC 27 work is needed to resolve ths.

The Fluxion scheduler does not currently support partial release
(flux-framework/flux-sched#1151).  But as discussed over there, the
combination of receiving an R fragment and a jobid in the free request
should be sufficient to get that working.
garlick added a commit to garlick/flux-core that referenced this issue Mar 21, 2024
Problem: jobs get stuck in CLEANUP state while long epilog
scripts run, causing sadness and idling resources.

Introduce a new type of epilog script called "housekeeping" that is
ostensibly job independent.  Instead of freeing resources directly
to the scheduler, jobs free resources to housekeeping, post their free
event, and may reach INACTIVE state.  Meanwhile, housekeeping can run
a script on the allocated resources and return the resources to the
scheduler when complete. The resources are still allocated to the job
as far as the scheduler is concerned while housekeeping runs.  However
since the job has transitioned to INACTIVE, the flux-accounting plugin
will decrement the running job count for the user and stop billing
the user for the resources.  'flux resource list' utility shows the
resources as allocated.

By default, resources are released all at once to the scheduler, as before.
However, if configured, resources can be freed to the scheduler immediately
as they complete housekeeping on each execution target, or a timer can be
started on completion of the first target, and when the timer expires, all
the targets that have completed thus far are freed in one go. Following that,
resources are freed to the scheduler immediately as they complete.

This works with sched-simple without changes, with the exception that the
hello protocol does not currently support partial release so, as noted in
the code, housekeeping and a new job could overlap when the scheduler is
reloaded on a live system.  Some RFC 27 work is needed to resolve ths.

The Fluxion scheduler does not currently support partial release
(flux-framework/flux-sched#1151).  But as discussed over there, the
combination of receiving an R fragment and a jobid in the free request
should be sufficient to get that working.
garlick added a commit to garlick/flux-core that referenced this issue Mar 21, 2024
Problem: jobs get stuck in CLEANUP state while long epilog
scripts run, causing sadness and idling resources.

Introduce a new type of epilog script called "housekeeping" that is
ostensibly job independent.  Instead of freeing resources directly
to the scheduler, jobs free resources to housekeeping, post their free
event, and may reach INACTIVE state.  Meanwhile, housekeeping can run
a script on the allocated resources and return the resources to the
scheduler when complete. The resources are still allocated to the job
as far as the scheduler is concerned while housekeeping runs.  However
since the job has transitioned to INACTIVE, the flux-accounting plugin
will decrement the running job count for the user and stop billing
the user for the resources.  'flux resource list' utility shows the
resources as allocated.

By default, resources are released all at once to the scheduler, as before.
However, if configured, resources can be freed to the scheduler immediately
as they complete housekeeping on each execution target, or a timer can be
started on completion of the first target, and when the timer expires, all
the targets that have completed thus far are freed in one go. Following that,
resources are freed to the scheduler immediately as they complete.

This works with sched-simple without changes, with the exception that the
hello protocol does not currently support partial release so, as noted in
the code, housekeeping and a new job could overlap when the scheduler is
reloaded on a live system.  Some RFC 27 work is needed to resolve ths.

The Fluxion scheduler does not currently support partial release
(flux-framework/flux-sched#1151).  But as discussed over there, the
combination of receiving an R fragment and a jobid in the free request
should be sufficient to get that working.
garlick added a commit to garlick/flux-core that referenced this issue Mar 25, 2024
Problem: jobs get stuck in CLEANUP state while long epilog
scripts run, causing sadness and idling resources.

Introduce a new type of epilog script called "housekeeping" that is
ostensibly job independent.  Instead of freeing resources directly
to the scheduler, jobs free resources to housekeeping, post their free
event, and may reach INACTIVE state.  Meanwhile, housekeeping can run
a script on the allocated resources and return the resources to the
scheduler when complete. The resources are still allocated to the job
as far as the scheduler is concerned while housekeeping runs.  However
since the job has transitioned to INACTIVE, the flux-accounting plugin
will decrement the running job count for the user and stop billing
the user for the resources.  'flux resource list' utility shows the
resources as allocated.

By default, resources are released all at once to the scheduler, as before.
However, if configured, resources can be freed to the scheduler immediately
as they complete housekeeping on each execution target, or a timer can be
started on completion of the first target, and when the timer expires, all
the targets that have completed thus far are freed in one go. Following that,
resources are freed to the scheduler immediately as they complete.

This works with sched-simple without changes, with the exception that the
hello protocol does not currently support partial release so, as noted in
the code, housekeeping and a new job could overlap when the scheduler is
reloaded on a live system.  Some RFC 27 work is needed to resolve ths.

The Fluxion scheduler does not currently support partial release
(flux-framework/flux-sched#1151).  But as discussed over there, the
combination of receiving an R fragment and a jobid in the free request
should be sufficient to get that working.
@milroy milroy self-assigned this Mar 28, 2024
@milroy
Copy link
Member

milroy commented Mar 28, 2024

With the assumption that a fragment contains a subset of the job's broker ranks but the entire R (i.e., the full R for each broker rank) for each fragment broker rank, adding this support should be straightforward.

Mainly what's needed is to identify the broker ranks in the R fragment and iterate through the vertices in the by_rank graph metadata map for each rank. Then remove the scheduling and planner data per vertex. Updating the vertices' ancestors' pruning filters will require some thought, though...

garlick added a commit to garlick/flux-core that referenced this issue Apr 3, 2024
Problem: jobs get stuck in CLEANUP state while long epilog
scripts run, causing sadness and idling resources.

Introduce a new type of epilog script called "housekeeping" that is
ostensibly job independent.  Instead of freeing resources directly
to the scheduler, jobs free resources to housekeeping, post their free
event, and may reach INACTIVE state.  Meanwhile, housekeeping can run
a script on the allocated resources and return the resources to the
scheduler when complete. The resources are still allocated to the job
as far as the scheduler is concerned while housekeeping runs.  However
since the job has transitioned to INACTIVE, the flux-accounting plugin
will decrement the running job count for the user and stop billing
the user for the resources.  'flux resource list' utility shows the
resources as allocated.

By default, resources are released all at once to the scheduler, as before.
However, if configured, resources can be freed to the scheduler immediately
as they complete housekeeping on each execution target, or a timer can be
started on completion of the first target, and when the timer expires, all
the targets that have completed thus far are freed in one go. Following that,
resources are freed to the scheduler immediately as they complete.

This works with sched-simple without changes, with the exception that the
hello protocol does not currently support partial release so, as noted in
the code, housekeeping and a new job could overlap when the scheduler is
reloaded on a live system.  Some RFC 27 work is needed to resolve ths.

The Fluxion scheduler does not currently support partial release
(flux-framework/flux-sched#1151).  But as discussed over there, the
combination of receiving an R fragment and a jobid in the free request
should be sufficient to get that working.
milroy added a commit to milroy/flux-sched that referenced this issue Apr 8, 2024
Problem: Fluxion issue
flux-framework#1151 and flux-core
issue flux-framework/flux-core#4312
identified the need for partial release of resources. The current
functionality need is to release all resources managed by a single
borker rank. In the future support for releasing arbitrary subgraphs
will be needed for cloud and converged use cases.

Modify the rem_* traverser functions to take a modification type and
type_to_count unordered_map. Add logic in the recursive job
modification calls to distinguish between a full and partial job
cancellation and issue corresponding planner interface calls, handling
errors as needed.
milroy added a commit to milroy/flux-sched that referenced this issue Apr 8, 2024
Problem: Fluxion issue
flux-framework#1151 and flux-core
issue flux-framework/flux-core#4312
identified the need for partial release of resources. The current
functionality need is to release all resources managed by a single
borker rank. In the future support for releasing arbitrary subgraphs
will be needed for cloud and converged use cases.

Modify the rem_* traverser functions to take a modification type and
type_to_count unordered_map. Add logic in the recursive job
modification calls to distinguish between a full and partial job
cancellation and issue corresponding planner interface calls, handling
errors as needed.
garlick added a commit to garlick/flux-core that referenced this issue Apr 18, 2024
Problem: jobs get stuck in CLEANUP state while long epilog
scripts run, causing sadness and idling resources.

Introduce a new type of epilog script called "housekeeping" that is
ostensibly job independent.  Instead of freeing resources directly
to the scheduler, jobs free resources to housekeeping, post their free
event, and may reach INACTIVE state.  Meanwhile, housekeeping can run
a script on the allocated resources and return the resources to the
scheduler when complete. The resources are still allocated to the job
as far as the scheduler is concerned while housekeeping runs.  However
since the job has transitioned to INACTIVE, the flux-accounting plugin
will decrement the running job count for the user and stop billing
the user for the resources.  'flux resource list' utility shows the
resources as allocated.

By default, resources are released all at once to the scheduler, as before.
However, if configured, resources can be freed to the scheduler immediately
as they complete housekeeping on each execution target, or a timer can be
started on completion of the first target, and when the timer expires, all
the targets that have completed thus far are freed in one go. Following that,
resources are freed to the scheduler immediately as they complete.

This works with sched-simple without changes, with the exception that the
hello protocol does not currently support partial release so, as noted in
the code, housekeeping and a new job could overlap when the scheduler is
reloaded on a live system.  Some RFC 27 work is needed to resolve ths.

The Fluxion scheduler does not currently support partial release
(flux-framework/flux-sched#1151).  But as discussed over there, the
combination of receiving an R fragment and a jobid in the free request
should be sufficient to get that working.
garlick added a commit to garlick/flux-core that referenced this issue May 8, 2024
Problem: jobs get stuck in CLEANUP state while long epilog
scripts run, causing sadness and idling resources.

Introduce a new type of epilog script called "housekeeping" that is
ostensibly job independent.  Instead of freeing resources directly
to the scheduler, jobs free resources to housekeeping, post their free
event, and may reach INACTIVE state.  Meanwhile, housekeeping can run
a script on the allocated resources and return the resources to the
scheduler when complete. The resources are still allocated to the job
as far as the scheduler is concerned while housekeeping runs.  However
since the job has transitioned to INACTIVE, the flux-accounting plugin
will decrement the running job count for the user and stop billing
the user for the resources.  'flux resource list' utility shows the
resources as allocated.

By default, resources are released all at once to the scheduler, as before.
However, if configured, resources can be freed to the scheduler immediately
as they complete housekeeping on each execution target, or a timer can be
started on completion of the first target, and when the timer expires, all
the targets that have completed thus far are freed in one go. Following that,
resources are freed to the scheduler immediately as they complete.

This works with sched-simple without changes, with the exception that the
hello protocol does not currently support partial release so, as noted in
the code, housekeeping and a new job could overlap when the scheduler is
reloaded on a live system.  Some RFC 27 work is needed to resolve ths.

The Fluxion scheduler does not currently support partial release
(flux-framework/flux-sched#1151).  But as discussed over there, the
combination of receiving an R fragment and a jobid in the free request
should be sufficient to get that working.
milroy added a commit to milroy/flux-sched that referenced this issue May 21, 2024
Problem: Fluxion issue
flux-framework#1151 and flux-core
issue flux-framework/flux-core#4312
identified the need for partial release of resources. The current
functionality need is to release all resources managed by a single
broker rank. In the future support for releasing arbitrary subgraphs
will be needed for cloud and converged use cases.

Modify the rem_* traverser functions to take a modification type and
type_to_count unordered_map. Add logic in the recursive job
modification calls to distinguish between a full and partial job
cancellation and issue corresponding planner interface calls, handling
errors as needed.
milroy added a commit to milroy/flux-sched that referenced this issue May 22, 2024
Problem: Fluxion issue
flux-framework#1151 and flux-core
issue flux-framework/flux-core#4312
identified the need for partial release of resources. The current
functionality need is to release all resources managed by a single
broker rank. In the future support for releasing arbitrary subgraphs
will be needed for cloud and converged use cases.

Modify the rem_* traverser functions to take a modification type and
type_to_count unordered_map. Add logic in the recursive job
modification calls to distinguish between a full and partial job
cancellation and issue corresponding planner interface calls, handling
errors as needed.
milroy added a commit to milroy/flux-sched that referenced this issue May 22, 2024
Problem: Fluxion issue
flux-framework#1151 and flux-core
issue flux-framework/flux-core#4312
identified the need for partial release of resources. The current
functionality need is to release all resources managed by a single
broker rank. In the future support for releasing arbitrary subgraphs
will be needed for cloud and converged use cases.

Modify the rem_* traverser functions to take a modification type and
type_to_count unordered_map. Add logic in the recursive job
modification calls to distinguish between a full and partial job
cancellation and issue corresponding planner interface calls, handling
errors as needed.
garlick added a commit to garlick/flux-core that referenced this issue Jun 6, 2024
Problem: jobs get stuck in CLEANUP state while long epilog
scripts run, causing sadness and idling resources.

Introduce a new type of epilog script called "housekeeping" that is
ostensibly job independent.  Instead of freeing resources directly
to the scheduler, jobs free resources to housekeeping, post their free
event, and may reach INACTIVE state.  Meanwhile, housekeeping can run
a script on the allocated resources and return the resources to the
scheduler when complete. The resources are still allocated to the job
as far as the scheduler is concerned while housekeeping runs.  However
since the job has transitioned to INACTIVE, the flux-accounting plugin
will decrement the running job count for the user and stop billing
the user for the resources.  'flux resource list' utility shows the
resources as allocated.

By default, resources are released all at once to the scheduler, as before.
However, if configured, resources can be freed to the scheduler immediately
as they complete housekeeping on each execution target, or a timer can be
started on completion of the first target, and when the timer expires, all
the targets that have completed thus far are freed in one go. Following that,
resources are freed to the scheduler immediately as they complete.

This works with sched-simple without changes, with the exception that the
hello protocol does not currently support partial release so, as noted in
the code, housekeeping and a new job could overlap when the scheduler is
reloaded on a live system.  Some RFC 27 work is needed to resolve ths.

The Fluxion scheduler does not currently support partial release
(flux-framework/flux-sched#1151).  But as discussed over there, the
combination of receiving an R fragment and a jobid in the free request
should be sufficient to get that working.
milroy added a commit to milroy/flux-sched that referenced this issue Jun 16, 2024
Problem: Fluxion issue
flux-framework#1151 and flux-core
issue flux-framework/flux-core#4312
identified the need for partial release of resources. The current
functionality need is to release all resources managed by a single
broker rank. In the future support for releasing arbitrary subgraphs
will be needed for cloud and converged use cases.

Modify the rem_* traverser functions to take a modification type and
type_to_count unordered_map. Add logic in the recursive job
modification calls to distinguish between a full and partial job
cancellation and issue corresponding planner interface calls, handling
errors as needed.
milroy added a commit to milroy/flux-sched that referenced this issue Jun 16, 2024
Problem: Fluxion issue
flux-framework#1151 and flux-core
issue flux-framework/flux-core#4312
identified the need for partial release of resources. The current
functionality need is to release all resources managed by a single
broker rank. In the future support for releasing arbitrary subgraphs
will be needed for cloud and converged use cases.

Modify the rem_* traverser functions to take a modification type and
type_to_count unordered_map. Add logic in the recursive job
modification calls to distinguish between a full and partial job
cancellation and issue corresponding planner interface calls, handling
errors as needed.
milroy added a commit to milroy/flux-sched that referenced this issue Jun 16, 2024
Problem: Fluxion issue
flux-framework#1151 and flux-core
issue flux-framework/flux-core#4312
identified the need for partial release of resources. The current
functionality need is to release all resources managed by a single
broker rank. In the future support for releasing arbitrary subgraphs
will be needed for cloud and converged use cases.

Modify the rem_* traverser functions to take a modification type and
type_to_count unordered_map. Add logic in the recursive job
modification calls to distinguish between a full and partial job
cancellation and issue corresponding planner interface calls, handling
errors as needed.
milroy added a commit to milroy/flux-sched that referenced this issue Jun 17, 2024
Problem: Fluxion issue
flux-framework#1151 and flux-core
issue flux-framework/flux-core#4312
identified the need for partial release of resources. The current
functionality need is to release all resources managed by a single
broker rank. In the future support for releasing arbitrary subgraphs
will be needed for cloud and converged use cases.

Modify the rem_* traverser functions to take a modification type and
type_to_count unordered_map. Add logic in the recursive job
modification calls to distinguish between a full and partial job
cancellation and issue corresponding planner interface calls, handling
errors as needed.
milroy added a commit to milroy/flux-sched that referenced this issue Jun 17, 2024
Problem: Fluxion issue
flux-framework#1151 and flux-core
issue flux-framework/flux-core#4312
identified the need for partial release of resources. The current
functionality need is to release all resources managed by a single
broker rank. In the future support for releasing arbitrary subgraphs
will be needed for cloud and converged use cases.

Modify the rem_* traverser functions to take a modification type and
type_to_count unordered_map. Add logic in the recursive job
modification calls to distinguish between a full and partial job
cancellation and issue corresponding planner interface calls, handling
errors as needed.
milroy added a commit to milroy/flux-sched that referenced this issue Jun 17, 2024
Problem: Fluxion issue
flux-framework#1151 and flux-core
issue flux-framework/flux-core#4312
identified the need for partial release of resources. The current
functionality need is to release all resources managed by a single
broker rank. In the future support for releasing arbitrary subgraphs
will be needed for cloud and converged use cases.

Modify the rem_* traverser functions to take a modification type and
type_to_count unordered_map. Add logic in the recursive job
modification calls to distinguish between a full and partial job
cancellation and issue corresponding planner interface calls, handling
errors as needed.
milroy added a commit to milroy/flux-sched that referenced this issue Jun 17, 2024
Problem: Fluxion issue
flux-framework#1151 and flux-core
issue flux-framework/flux-core#4312
identified the need for partial release of resources. The current
functionality need is to release all resources managed by a single
broker rank. In the future support for releasing arbitrary subgraphs
will be needed for cloud and converged use cases.

Modify the rem_* traverser functions to take a modification type and
type_to_count unordered_map. Add logic in the recursive job
modification calls to distinguish between a full and partial job
cancellation and issue corresponding planner interface calls, handling
errors as needed.
milroy added a commit to milroy/flux-sched that referenced this issue Jun 28, 2024
Problem: Fluxion issue
flux-framework#1151 and flux-core
issue flux-framework/flux-core#4312
identified the need for partial release of resources. The current
functionality need is to release all resources managed by a single
broker rank. In the future support for releasing arbitrary subgraphs
will be needed for cloud and converged use cases.

Modify the rem_* traverser functions to take a modification type and
type_to_count unordered_map. Add logic in the recursive job
modification calls to distinguish between a full and partial job
cancellation and issue corresponding planner interface calls, handling
errors as needed.
milroy added a commit to milroy/flux-sched that referenced this issue Jun 28, 2024
Problem: Fluxion issue
flux-framework#1151 and flux-core
issue flux-framework/flux-core#4312
identified the need for partial release of resources. The current
functionality need is to release all resources managed by a single
broker rank. In the future support for releasing arbitrary subgraphs
will be needed for cloud and converged use cases.

Modify the rem_* traverser functions to take a modification type and
type_to_count unordered_map. Add logic in the recursive job
modification calls to distinguish between a full and partial job
cancellation and issue corresponding planner interface calls, handling
errors as needed.
milroy added a commit to milroy/flux-sched that referenced this issue Jun 28, 2024
Problem: Fluxion issue
flux-framework#1151 and flux-core
issue flux-framework/flux-core#4312
identified the need for partial release of resources. The current
functionality need is to release all resources managed by a single
broker rank. In the future support for releasing arbitrary subgraphs
will be needed for cloud and converged use cases.

Modify the rem_* traverser functions to take a modification type and
type_to_count unordered_map. Add logic in the recursive job
modification calls to distinguish between a full and partial job
cancellation and issue corresponding planner interface calls, handling
errors as needed.
milroy added a commit to milroy/flux-sched that referenced this issue Jun 28, 2024
Problem: Fluxion issue
flux-framework#1151 and flux-core
issue flux-framework/flux-core#4312
identified the need for partial release of resources. The current
functionality need is to release all resources managed by a single
broker rank. In the future support for releasing arbitrary subgraphs
will be needed for cloud and converged use cases.

Modify the rem_* traverser functions to take a modification type and
type_to_count unordered_map. Add logic in the recursive job
modification calls to distinguish between a full and partial job
cancellation and issue corresponding planner interface calls, handling
errors as needed.
milroy added a commit to milroy/flux-sched that referenced this issue Jun 28, 2024
Problem: Fluxion issue
flux-framework#1151 and flux-core
issue flux-framework/flux-core#4312
identified the need for partial release of resources. The current
functionality need is to release all resources managed by a single
broker rank. In the future support for releasing arbitrary subgraphs
will be needed for cloud and converged use cases.

Modify the rem_* traverser functions to take a modification type and
type_to_count unordered_map. Add logic in the recursive job
modification calls to distinguish between a full and partial job
cancellation and issue corresponding planner interface calls, handling
errors as needed.
milroy added a commit to milroy/flux-sched that referenced this issue Jun 28, 2024
Problem: Fluxion issue
flux-framework#1151 and flux-core
issue flux-framework/flux-core#4312
identified the need for partial release of resources. The current
functionality need is to release all resources managed by a single
broker rank. In the future support for releasing arbitrary subgraphs
will be needed for cloud and converged use cases.

Modify the rem_* traverser functions to take a modification type and
type_to_count unordered_map. Add logic in the recursive job
modification calls to distinguish between a full and partial job
cancellation and issue corresponding planner interface calls, handling
errors as needed.
milroy added a commit to milroy/flux-sched that referenced this issue Jun 28, 2024
Problem: Fluxion issue
flux-framework#1151 and flux-core
issue flux-framework/flux-core#4312
identified the need for partial release of resources. The current
functionality need is to release all resources managed by a single
broker rank. In the future support for releasing arbitrary subgraphs
will be needed for cloud and converged use cases.

Modify the rem_* traverser functions to take a modification type and
type_to_count unordered_map. Add logic in the recursive job
modification calls to distinguish between a full and partial job
cancellation and issue corresponding planner interface calls, handling
errors as needed.
milroy added a commit to milroy/flux-sched that referenced this issue Jun 28, 2024
Problem: Fluxion issue
flux-framework#1151 and flux-core
issue flux-framework/flux-core#4312
identified the need for partial release of resources. The current
functionality need is to release all resources managed by a single
broker rank. In the future support for releasing arbitrary subgraphs
will be needed for cloud and converged use cases.

Modify the rem_* traverser functions to take a modification type and
type_to_count unordered_map. Add logic in the recursive job
modification calls to distinguish between a full and partial job
cancellation and issue corresponding planner interface calls, handling
errors as needed.
milroy added a commit to milroy/flux-sched that referenced this issue Jun 28, 2024
Problem: Fluxion issue
flux-framework#1151 and flux-core
issue flux-framework/flux-core#4312
identified the need for partial release of resources. The current
functionality need is to release all resources managed by a single
broker rank. In the future support for releasing arbitrary subgraphs
will be needed for cloud and converged use cases.

Modify the rem_* traverser functions to take a modification type and
type_to_count unordered_map. Add logic in the recursive job
modification calls to distinguish between a full and partial job
cancellation and issue corresponding planner interface calls, handling
errors as needed.
milroy added a commit to milroy/flux-sched that referenced this issue Jun 28, 2024
Problem: Fluxion issue
flux-framework#1151 and flux-core
issue flux-framework/flux-core#4312
identified the need for partial release of resources. The current
functionality need is to release all resources managed by a single
broker rank. In the future support for releasing arbitrary subgraphs
will be needed for cloud and converged use cases.

Modify the rem_* traverser functions to take a modification type and
type_to_count unordered_map. Add logic in the recursive job
modification calls to distinguish between a full and partial job
cancellation and issue corresponding planner interface calls, handling
errors as needed.
milroy added a commit to milroy/flux-sched that referenced this issue Jun 28, 2024
Problem: Fluxion issue
flux-framework#1151 and flux-core
issue flux-framework/flux-core#4312
identified the need for partial release of resources. The current
functionality need is to release all resources managed by a single
broker rank. In the future support for releasing arbitrary subgraphs
will be needed for cloud and converged use cases.

Modify the rem_* traverser functions to take a modification type and
type_to_count unordered_map. Add logic in the recursive job
modification calls to distinguish between a full and partial job
cancellation and issue corresponding planner interface calls, handling
errors as needed.
milroy added a commit to milroy/flux-sched that referenced this issue Jun 28, 2024
Problem: Fluxion issue
flux-framework#1151 and flux-core
issue flux-framework/flux-core#4312
identified the need for partial release of resources. The current
functionality need is to release all resources managed by a single
broker rank. In the future support for releasing arbitrary subgraphs
will be needed for cloud and converged use cases.

Modify the rem_* traverser functions to take a modification type and
type_to_count unordered_map. Add logic in the recursive job
modification calls to distinguish between a full and partial job
cancellation and issue corresponding planner interface calls, handling
errors as needed.
grondo pushed a commit to garlick/flux-core that referenced this issue Jun 28, 2024
Problem: jobs get stuck in CLEANUP state while long epilog
scripts run, causing sadness and idling resources.

Introduce a new type of epilog script called "housekeeping" that runs
after the job.  Instead of freeing resources directly to the scheduler,
jobs free resources to housekeeping, post their free event, and may reach
INACTIVE state.  Meanwhile, housekeeping can run a script on the allocated
resources and return the resources to the scheduler when complete.  The
resources are still allocated to the job as far as the scheduler is
concerned while housekeeping runs.  However since the job has transitioned
to INACTIVE, the flux-accounting plugin will decrement the running job
count for the user and stop billing the user for the resources.
'flux resource list' utility shows the resources as allocated.

By default, resources are released to the scheduler only after all ranks
complete housekeeping, as before.  However, if configured, resources can
be freed to the scheduler immediately as they complete housekeeping on
each execution target, or a timer can be started on completion of the
first target, and when the timer expires, all the targets that have
completed thus far are freed in one go. Following that, resources are
freed to the scheduler immediately as they complete.

This works with sched-simple without changes, with the exception that the
hello protocol does not currently support partial release so, as noted in
the code, housekeeping and a new job could overlap when the scheduler is
reloaded on a live system.  Some RFC 27 work is needed to resolve ths.

The Fluxion scheduler does not currently support partial release
(flux-framework/flux-sched#1151).  But as discussed over there, the
combination of receiving an R fragment and a jobid in the free request
should be sufficient to get that working.
garlick added a commit to garlick/flux-core that referenced this issue Jul 1, 2024
Problem: jobs get stuck in CLEANUP state while long epilog
scripts run, causing sadness and idling resources.

Introduce a new type of epilog script called "housekeeping" that runs
after the job.  Instead of freeing resources directly to the scheduler,
jobs free resources to housekeeping, post their free event, and may reach
INACTIVE state.  Meanwhile, housekeeping can run a script on the allocated
resources and return the resources to the scheduler when complete.  The
resources are still allocated to the job as far as the scheduler is
concerned while housekeeping runs.  However since the job has transitioned
to INACTIVE, the flux-accounting plugin will decrement the running job
count for the user and stop billing the user for the resources.
'flux resource list' utility shows the resources as allocated.

By default, resources are released to the scheduler only after all ranks
complete housekeeping, as before.  However, if configured, resources can
be freed to the scheduler immediately as they complete housekeeping on
each execution target, or a timer can be started on completion of the
first target, and when the timer expires, all the targets that have
completed thus far are freed in one go. Following that, resources are
freed to the scheduler immediately as they complete.

This works with sched-simple without changes, with the exception that the
hello protocol does not currently support partial release so, as noted in
the code, housekeeping and a new job could overlap when the scheduler is
reloaded on a live system.  Some RFC 27 work is needed to resolve ths.

The Fluxion scheduler does not currently support partial release
(flux-framework/flux-sched#1151).  But as discussed over there, the
combination of receiving an R fragment and a jobid in the free request
should be sufficient to get that working.
@jameshcorbett
Copy link
Member

With the assumption that a fragment contains a subset of the job's broker ranks but the entire R (i.e., the full R for each broker rank) for each fragment broker rank, adding this support should be straightforward.

Mainly what's needed is to identify the broker ranks in the R fragment and iterate through the vertices in the by_rank graph metadata map for each rank.

My understanding is that on elcap systems, the scheduler will need to be initialized from JGF in order to understand rabbit layout. Also, it will need to emit JGF for jobs in order to facilitate scheduler restart. The partial release will come in the form of R but that's OK because of this simplifying assumption right?

garlick added a commit to garlick/flux-core that referenced this issue Jul 2, 2024
Problem: jobs get stuck in CLEANUP state while long epilog
scripts run, causing sadness and idling resources.

Introduce a new type of epilog script called "housekeeping" that runs
after the job.  Instead of freeing resources directly to the scheduler,
jobs free resources to housekeeping, post their free event, and may reach
INACTIVE state.  Meanwhile, housekeeping can run a script on the allocated
resources and return the resources to the scheduler when complete.  The
resources are still allocated to the job as far as the scheduler is
concerned while housekeeping runs.  However since the job has transitioned
to INACTIVE, the flux-accounting plugin will decrement the running job
count for the user and stop billing the user for the resources.
'flux resource list' utility shows the resources as allocated.

By default, resources are released to the scheduler only after all ranks
complete housekeeping, as before.  However, if configured, resources can
be freed to the scheduler immediately as they complete housekeeping on
each execution target, or a timer can be started on completion of the
first target, and when the timer expires, all the targets that have
completed thus far are freed in one go. Following that, resources are
freed to the scheduler immediately as they complete.

This works with sched-simple without changes, with the exception that the
hello protocol does not currently support partial release so, as noted in
the code, housekeeping and a new job could overlap when the scheduler is
reloaded on a live system.  Some RFC 27 work is needed to resolve ths.

The Fluxion scheduler does not currently support partial release
(flux-framework/flux-sched#1151).  But as discussed over there, the
combination of receiving an R fragment and a jobid in the free request
should be sufficient to get that working.
milroy added a commit to milroy/flux-sched that referenced this issue Jul 3, 2024
Problem: Fluxion issue
flux-framework#1151 and flux-core
issue flux-framework/flux-core#4312
identified the need for partial release of resources. The current
functionality need is to release all resources managed by a single
broker rank. In the future support for releasing arbitrary subgraphs
will be needed for cloud and converged use cases.

Modify the rem_* traverser functions to take a modification type and
type_to_count unordered_map. Add logic in the recursive job
modification calls to distinguish between a full and partial job
cancellation and issue corresponding planner interface calls, handling
errors as needed.
milroy added a commit to milroy/flux-sched that referenced this issue Jul 4, 2024
Problem: Fluxion issue
flux-framework#1151 and flux-core
issue flux-framework/flux-core#4312
identified the need for partial release of resources. The current
functionality need is to release all resources managed by a single
broker rank. In the future support for releasing arbitrary subgraphs
will be needed for cloud and converged use cases.

Modify the rem_* traverser functions to take a modification type and
type_to_count unordered_map. Add logic in the recursive job
modification calls to distinguish between a full and partial job
cancellation and issue corresponding planner interface calls, handling
errors as needed.
milroy added a commit to milroy/flux-sched that referenced this issue Jul 4, 2024
Problem: Fluxion issue
flux-framework#1151 and flux-core
issue flux-framework/flux-core#4312
identified the need for partial release of resources. The current
functionality need is to release all resources managed by a single
broker rank. In the future support for releasing arbitrary subgraphs
will be needed for cloud and converged use cases.

Modify the rem_* traverser functions to take a modification type and
type_to_count unordered_map. Add logic in the recursive job
modification calls to distinguish between a full and partial job
cancellation and issue corresponding planner interface calls, handling
errors as needed.
milroy added a commit to milroy/flux-sched that referenced this issue Jul 6, 2024
Problem: Fluxion issue
flux-framework#1151 and flux-core
issue flux-framework/flux-core#4312
identified the need for partial release of resources. The current
functionality need is to release all resources managed by a single
broker rank. In the future support for releasing arbitrary subgraphs
will be needed for cloud and converged use cases.

Modify the rem_* traverser functions to take a modification type and
type_to_count unordered_map. Add logic in the recursive job
modification calls to distinguish between a full and partial job
cancellation and issue corresponding planner interface calls, handling
errors as needed.
milroy added a commit to milroy/flux-sched that referenced this issue Jul 6, 2024
Problem: Fluxion issue
flux-framework#1151 and flux-core
issue flux-framework/flux-core#4312
identified the need for partial release of resources. The current
functionality need is to release all resources managed by a single
broker rank. In the future support for releasing arbitrary subgraphs
will be needed for cloud and converged use cases.

Modify the rem_* traverser functions to take a modification type and
type_to_count unordered_map. Add logic in the recursive job
modification calls to distinguish between a full and partial job
cancellation and issue corresponding planner interface calls, handling
errors as needed.
milroy added a commit to milroy/flux-sched that referenced this issue Jul 6, 2024
Problem: Fluxion issue
flux-framework#1151 and flux-core
issue flux-framework/flux-core#4312
identified the need for partial release of resources. The current
functionality need is to release all resources managed by a single
broker rank. In the future support for releasing arbitrary subgraphs
will be needed for cloud and converged use cases.

Modify the rem_* traverser functions to take a modification type and
type_to_count unordered_map. Add logic in the recursive job
modification calls to distinguish between a full and partial job
cancellation and issue corresponding planner interface calls, handling
errors as needed.

Switch cancallation behavior based on the job_modify_t enum class.
milroy added a commit to milroy/flux-sched that referenced this issue Jul 7, 2024
Problem: Fluxion issue
flux-framework#1151 and flux-core
issue flux-framework/flux-core#4312
identified the need for partial release of resources. The current
functionality need is to release all resources managed by a single
broker rank. In the future support for releasing arbitrary subgraphs
will be needed for cloud and converged use cases.

Modify the rem_* traverser functions to take a modification type and
type_to_count unordered_map. Add logic in the recursive job
modification calls to distinguish between a full and partial job
cancellation and issue corresponding planner interface calls, handling
errors as needed.

Switch cancallation behavior based on the job_modify_t enum class.
milroy added a commit to milroy/flux-sched that referenced this issue Jul 7, 2024
Problem: Fluxion issue
flux-framework#1151 and flux-core
issue flux-framework/flux-core#4312
identified the need for partial release of resources. The current
functionality need is to release all resources managed by a single
broker rank. In the future support for releasing arbitrary subgraphs
will be needed for cloud and converged use cases.

Modify the rem_* traverser functions to take a modification type and
type_to_count unordered_map. Add logic in the recursive job
modification calls to distinguish between a full and partial job
cancellation and issue corresponding planner interface calls, handling
errors as needed.

Switch cancallation behavior based on the job_modify_t enum class.
trws pushed a commit to trws/flux-sched that referenced this issue Jul 8, 2024
Problem: Fluxion issue
flux-framework#1151 and flux-core
issue flux-framework/flux-core#4312
identified the need for partial release of resources. The current
functionality need is to release all resources managed by a single
broker rank. In the future support for releasing arbitrary subgraphs
will be needed for cloud and converged use cases.

Modify the rem_* traverser functions to take a modification type and
type_to_count unordered_map. Add logic in the recursive job
modification calls to distinguish between a full and partial job
cancellation and issue corresponding planner interface calls, handling
errors as needed.
trws pushed a commit to trws/flux-sched that referenced this issue Jul 9, 2024
Problem: Fluxion issue
flux-framework#1151 and flux-core
issue flux-framework/flux-core#4312
identified the need for partial release of resources. The current
functionality need is to release all resources managed by a single
broker rank. In the future support for releasing arbitrary subgraphs
will be needed for cloud and converged use cases.

Modify the rem_* traverser functions to take a modification type and
type_to_count unordered_map. Add logic in the recursive job
modification calls to distinguish between a full and partial job
cancellation and issue corresponding planner interface calls, handling
errors as needed.

Switch cancallation behavior based on the job_modify_t enum class.
milroy added a commit to milroy/flux-sched that referenced this issue Jul 9, 2024
Problem: Fluxion issue
flux-framework#1151 and flux-core
issue flux-framework/flux-core#4312
identified the need for partial release of resources. The current
functionality need is to release all resources managed by a single
broker rank. In the future support for releasing arbitrary subgraphs
will be needed for cloud and converged use cases.

Modify the rem_* traverser functions to take a modification type and
type_to_count unordered_map. Add logic in the recursive job
modification calls to distinguish between a full and partial job
cancellation and issue corresponding planner interface calls, handling
errors as needed.

Switch cancallation behavior based on the job_modify_t enum class.
@milroy
Copy link
Member

milroy commented Jul 10, 2024

The partial release will come in the form of R but that's OK because of this simplifying assumption right?

That's correct. The partial cancel/release just uses the Rlite fragment string contained in the free RPC payload.

adding this support should be straightforward.

Famous last words. Fortunately the PR is merged and the functionality is in Fluxion now.

@trws
Copy link
Member

trws commented Jul 31, 2024

@milroy, it looks like this one can be closed, so I'm closing it. If there's something we need to keep open here feel free to re-open.

@trws trws closed this as completed Jul 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants