job-manager: add support for housekeeping scripts with partial release of resources #5818

garlick · 2024-03-21T20:57:09Z

This is a prototype of an idea @grondo and I were discussing last week after someone was complaining about a large job stuck in CLEANUP state while the epilog ran (reminding us with a pang of guilt that the lack of "partial release" support is going to be an ongoing source of sadness). This introduces a new type of epilog script called "housekeeping" that runs after the one we have now. There are two main features here

the job can reach INACTIVE state before housekeeping begins. This implies that the job no longer shows up in the active job list, and that flux-accounting stops accounting for the resources against the user.
if the scheduler supports it, resources can be freed to the scheduler in parts as housekeeping completes on each execution target

sched-simple supports (2) above. Fluxion does not, yet. See flux-framework/flux-sched#1151.

The idea was that maybe we could move some things that take a long time (like ansible updates) and that are not job-specific, from the epilog to housekeeping. In addition, working on this will remove some barriers to implementing partial release of resources from jobs, which is more involved, but the ultimate goal.

One loose end is the RFC 27 hello request. Currently an R fragment cannot be passed back that way. For now, if a job's allocation is partially freed when the scheduler is reloaded, it is logged at LOG_ERR and not included in the list of pre-allocated jobs. We accept that we might schedule a job on a node before housekeeping is finished when the scheduler is reloaded (hopefully rare). More design work required for RFC 27.

Another is tooling. flux resource list shows resources in housekeeping as allocated. It would be nice to know at a glance which nodes are tied up in housekeeping and maybe provide other information that would add transparency and improve how the admins interact with the machine.

~~This is currently based on #5796 because there was a bit of overlap in the code, but conceptually they are separate features. At the time this PR was posted, the new work is the top 4 commits.~~

I probably jumped the gun going this far with this without getting some feedback on the approach. I'll just say it's only a couple days work and I learned a fair bit about the problem space so we can throw this out and start over if it ends up being off base.

src/modules/job-manager/housekeeping.c

garlick · 2024-04-03T00:11:57Z

rebased on current master

garlick · 2024-04-03T13:49:44Z

Possibly this is going too far, but to reuse as much tooling as possible, it would be nice if housekeeping "jobs" appeared as a real jobs that you could view with flux jobs, that produce an eventlog for debugging, that could be recovered on restart, etc.

I wonder if we could revive some of the "fastrun" experiment e.g. 9f2a338 to allow the job manager to start instance owner jobs on its own. Then start one of these jobs for housekeeping each time a regular job completes. Bypass the exec system in favor of the direct, stripped down approach proposed here (no shell), and implement partial release using resource-update so flux jobs would list housekeeping jobs:

$ flux jobs
       JOBID USER     NAME       ST NTASKS NNODES     TIME INFO
   ƒ26aXQeNs flux     housekeep+ R       2      2     3.5h elcap[3423,3499]

and it would be obvious which nodes are stuck, when that happens. The eventlog could provide a nice record of how nodes behaved.

grondo · 2024-04-03T14:43:17Z

That's a cool idea! While there would be some interesting benefits, I did consider some issues:

Would we end up in the same place because jobs don't currently support partial release? I.e. we'd have to solve partial release and if we did that we could just use traditional epilogs.

Also, I wonder what would actually be in the eventlog for these jobs, since there is no scheduling and no job shell? Would it just be a start, finish, and clean? Without a job shell we wouldn't capture the housekeeping job's output in the traditional way.

The housekeeping jobs would have to be special cased to avoid running the job prolog and also spawning housekeeping work themselves.

Not to say these issues can't be overcome, but I thought I'd bring them up.

garlick · 2024-04-03T14:53:38Z

Hmm, great points to ponder.

I might be trivializing, but I thought partial release from the housekeeping exec to the scheduler could be implemented as it is here, except we'd support resource-update to show what's been released to the tools.

I guess that implies some sort of "exec bypass" in the job manager.

Perhaps in the spirit of prototyping I could try to tack something on here as a proof of concept if the idea isn't too outlandish.

grondo · 2024-04-03T14:56:08Z

As a simpler, though less interesting, alternative, we could add a new resource "state" like allocated or free. I'm not sure I like "houskeeping" but maybe call it maint or service mode for now, since housekeeping is kind of like scheduled maintenance.

The job manager could just keep the set of resources in this state and send that set along in the job-manager.resource-status response with the allocated resources, and flux resource list could report nodes in this state.

The one thing missing here that's really nice in using the job abstraction (or even the traditional approach of keeping jobs in CLEANUP state) is that there's no tracking of how long resources have been doing housekeeping/maintenance. Probably would need some way to provide that. 🤔

grondo · 2024-04-03T15:16:44Z

Perhaps in the spirit of prototyping I could try to tack something on here as a proof of concept if the idea isn't too outlandish.

It does not seem too outlandish for a prototype. My main worry is having two classes of jobs, and the fallout this could cause among all the tools. However, I do see why tracking these things as jobs is compelling.

grondo · 2024-04-03T15:23:18Z

I might be trivializing, but I thought partial release from the housekeeping exec to the scheduler could be implemented as it is here, except we'd support resource-update to show what's been released to the tools.

This might be good to do anyway since there's interest in allowing jobs to release resources while still running. We'll have to think through how tools and the job record reflect a changing resource set.

garlick · 2024-04-25T14:28:33Z

Mentioning @ryanday36.

Ryan, this is a new epilog like capability that runs after each job but does not prevent the job from completing. It also supports "partial release" whereby any stragglers are returned to the scheduler independently. The main topic of discussion is tooling. How do you tell that housekeeping is still running on node(s) and for how long?

Some other aspects are:

stdio is not captured and nodes are not drained when the script fails. I figured the script itself could arrange for either of those things. For example calling flux logger or flux resource drain) but that could be changed.
right now there is a corner case where a restart of Flux could lose track of running housekeeping work, and new jobs could start on a node before housekeeping is finished. Presumably we'll need to address that in some way.

ryanday36 · 2024-04-25T17:53:06Z

This seems like an interesting approach. As far as tooling, if you do list the housekeeping jobs are "real" jobs, I like that the they are listed as belonging to the flux user rather than the original job user. Slurm jobs in "completing" states that list the submitting user seem to be a perpetual source of confusion for our users. If you do list them as "real" jobs though, it would be good if they had some state other than "R" to avoid confusion.

I think that using flux logger and flux resource drain in the housekeeping jobs would be acceptable, although it would still be good if the node got drained if a housekeeping job exited with anything other than success.

I agree that yes, it would be good to track nodes that have unfinished housekeeping.

My last thought is a sort of meta-concern. It seems like whenever we talk about the current prolog / epilog it's described as something that you're not quite happy with. Would this be further cementing that implementation, or would it be separating concerns in a way that would allow you to make changes to the current prolog / epilog more easily?

grondo · 2024-04-25T20:03:39Z

I think that using flux logger and flux resource drain in the housekeeping jobs would be acceptable

My worry is that unanticipated errors from a script or set or scripts run in this way will be impossible to diagnose after the fact. You can easily code your housekeeping script to do

  try_something || flux logger "something failed"

Are we going to suggest authors do that on every line?

What if your python script throws an unexpected exception, are we going to suggest cleanup scripts are wrapped in a large try/expect block so that the exception can be logged with flux logger? What if the exception is that the script tried to connect to flux and somehow failed?

If we represent these things as jobs, we have a nice way to capture output from jobs, so maybe we could just do that? Is the worry that it is just too much work, or that it will double the KVS usage since there will be one housekeeping job per actual job?

garlick · 2024-04-25T20:14:31Z

Good points. I guess I was thinking of the runner script that they have now that could log errors from the scripts that it runs, but it's actually easy enough to capture output if we need it. The downside is if we capture it in the flux log like we do now, it can be a source of log noise that pushes out the interesting stuff. I think its very easy to add.

grondo · 2024-04-25T20:17:37Z

If these housekeeping scripts are emulated like jobs could we just log the errors in an output eventlog?
Or are they only jobs as far as job-list is concerned, so all other job related commands are just going to say "no such jobid"?

grondo · 2024-04-25T20:51:55Z

It seems like whenever we talk about the current prolog / epilog it's described as something that you're not quite happy with. Would this be further cementing that implementation, or would it be separating concerns in a way that would allow you to make changes to the current prolog / epilog more easily?

The problem with the current way the per-node epilog is managed is that it uses a facility that was designed for things that need to run once in a central location after each job to remotely execute the epilog using a specialized script. This was meant to be a stopgap until we could distribute the job execution system, which would then be able to manage the epilog scripts directly from each broker.

However, in my opinion, the solution proposed here is superior to even what we had planned. The benefits I see are:

The administrative epilog is separated from the user's job, both for runtime accounting and also reporting as you note
The current style epilog could still be used during the transition, or for than anything that should keep resources associated with the user's job until completion (i.e. both epilog styles can coeexist)
It doesn't require the job execution system to be distributed
Implementation of how the epilog or housekeeping scripts are run can be improved over time (not dependent on job execution system)
More flexible: you could presumably schedule maintenance or other work not related to jobs as 'housekeeping'

The only drawback I can see is perhaps more muddled reporting -- it isn't clear what user job spawned a housekeeping job (if that is the way it is going to be reported). But that could easily be queried, or could even be an annotation come to think of it. Maybe that is actually a benefit.

Also, reporting housekeeping work as jobs will (I assume?) report the nodes as "allocated" instead of some other state, and when queries are made for jobs there will always be some housekeeping jobs reported that would likely have to be filtered out.

ryanday36 · 2024-04-25T21:46:57Z

Cool. Thanks for the discussion on the benefits @grondo. That makes me feel better about it.

I do think that it would be good to be able to correlate housekeeping jobs with the user job that launched them. It's good to be able to find out if one user is consistently managing to leave nodes in bad states.

I'd actually be interested in being able to track the time spent in housekeeping separately from the time running the user jobs. It could be a good tool for keeping run_ansible and other epilog things from getting out of hand. That said, for most reporting, we'd probably lump them in as 'allocated'. I believe that's how time spent in 'completing' currently gets counted in Slurm.

Lastly, good points on the logging discussion. Things like run_ansible do generate a ton of output that would probably be good to keep out of the flux log. It would be good to be able to get to it if the housekeeping job fails though, so a job specific output event log that could be referenced if the job fails seems like it would be useful.

garlick · 2024-06-06T14:58:02Z

Rebased on current master with no other changes.

src/modules/job-manager/housekeeping.c

+
+    if (!(hk = calloc (1, sizeof (*hk))))
+        return NULL;
+    hk->ctx = ctx;


garlick · 2024-06-17T21:51:40Z

Changes in that last push were

convert to bulk_exec
add systemd unit for the housekeeping script

Per offline discussion, next steps are

add query RPC to allow a command line tool to list nodes in housekeeping, by job id
add cancel RPC to allow a set of ranks to be sent a signal
drain nodes on failure

garlick · 2024-06-29T14:48:10Z

Trying on my test cluster. I did get the following from the bash completions:

-bash: /etc/bash_completion.d/flux: line 2178: syntax error near unexpected token `)'
-bash: /etc/bash_completion.d/flux: line 2178: `    housekeeping)'
-bash: /etc/bash_completion.d/flux: line 2178: syntax error near unexpected token `)'
-bash: /etc/bash_completion.d/flux: line 2178: `    housekeeping)'

otherwise looks good so far!

grondo · 2024-06-29T23:10:23Z

Thanks for testing! I thought I'd fixed that one, sorry! I'll fix it up when I get back to a computer

grondo · 2024-06-30T01:06:21Z

Ok, fixed that bash completion error (it was a mistake in fixing up a conflict)

garlick · 2024-06-30T03:23:39Z

Thanks! I just pushed a couple of typos (mine!)

grondo · 2024-07-01T18:18:24Z

fixup commits LGTM! (though make dist is failing, I guess experimental.rst also need to go in EXTRA_DIST?)

garlick · 2024-07-01T18:23:22Z

Oops! Ok I'll fix that. OK to squash the fixups and rebase on master?

grondo

This LGTM! Tested heavily on my test cluster with no issues so far!

garlick · 2024-07-02T01:39:12Z

OK! 🎺 🎺

Problem: the bulk-exec API in job-exec is useful elsewhere, but it is local to the job-exec module. Move it into the libsubprocess directory where it can be more easily reused within flux-core (it is not added to the public API). Update job-exec.

Problem: free requests may not take place in the context of the job once housekeeping is in place, but t2212-job-manager-plugins.t uses the debug.free-request event as an indication that the job debug flag could be set. Use debug.alloc-request in the test instead.

Problem: jobs get stuck in CLEANUP state while long epilog scripts run, causing sadness and idling resources. Introduce a new type of epilog script called "housekeeping" that runs after the job. Instead of freeing resources directly to the scheduler, jobs free resources to housekeeping, post their free event, and may reach INACTIVE state. Meanwhile, housekeeping can run a script on the allocated resources and return the resources to the scheduler when complete. The resources are still allocated to the job as far as the scheduler is concerned while housekeeping runs. However since the job has transitioned to INACTIVE, the flux-accounting plugin will decrement the running job count for the user and stop billing the user for the resources. 'flux resource list' utility shows the resources as allocated. By default, resources are released to the scheduler only after all ranks complete housekeeping, as before. However, if configured, resources can be freed to the scheduler immediately as they complete housekeeping on each execution target, or a timer can be started on completion of the first target, and when the timer expires, all the targets that have completed thus far are freed in one go. Following that, resources are freed to the scheduler immediately as they complete. This works with sched-simple without changes, with the exception that the hello protocol does not currently support partial release so, as noted in the code, housekeeping and a new job could overlap when the scheduler is reloaded on a live system. Some RFC 27 work is needed to resolve ths. The Fluxion scheduler does not currently support partial release (flux-framework/flux-sched#1151). But as discussed over there, the combination of receiving an R fragment and a jobid in the free request should be sufficient to get that working.

Problem: housekeeping scripts should be run in a systemd cgroup for reliable termination, logging to the local systemd journal, and debugging using well known systemd tools. Add a systemd "oneshot" service unit for housekeeping, templated by jobid and corresponding script that can be configured as an imp run target. This is similar to what was proposed for prolog/epilog except if the script fails, the node is drained.

Problem: bulk_exec does not offer an interface to kill a subset of the running subprocesses. Add a ranks parameter to bulk_exec_kill() and bulk_exec_imp_kill(). Set it to NULL to target all subprocesses like before. Update test and users.

Problem: There is interface to query information about the housekeeping subsystem. Add a new flux-housekeeping(1) command, which can be used to query or terminate housekeeping tasks.

Problem: The flux-housekeeping(1) command doesn't have any bash tab completions. Add them.

Problem: there is no test coverage for job-manager housekeeping. Add a sharness script.

Problem: There is no manual for flux-housekeeping(1). Add a simple man page for flux-housekeeping(1).

Problem: The job-manager.housekeeping config table is not documented. Document it in flux-config-job-manager(5).

Problem: the text to explain expectations for experimental features must be repeated in documentation for each one which is extra work for both the author and the reader. Add common/experimental.rst which can be included from man pages that discuss experimental features. Like common/resource.rst, the idea is to make it a standalone section which can then be referenced from the feature documentation. This should improve consistency and make it easier to document these features and interfaces. Fixes flux-framework#6066

Problem: the EXPERIMENTAL section refers to the project issue tracker but man pages do not contain the URL. Add it.

codecov · 2024-09-30T02:30:19Z

Codecov Report

Attention: Patch coverage is 79.31727% with 103 lines in your changes missing coverage. Please review.

Project coverage is 83.34%. Comparing base (056c590) to head (d82c356).
Report is 530 commits behind head on master.

Files with missing lines	Patch %	Lines
src/modules/job-manager/housekeeping.c	78.43%	80 Missing ⚠️
src/cmd/flux-housekeeping.py	83.16%	17 Missing ⚠️
src/modules/job-manager/job-manager.c	55.55%	4 Missing ⚠️
src/modules/job-manager/alloc.c	75.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5818      +/-   ##
==========================================
- Coverage   83.34%   83.34%   -0.01%     
==========================================
  Files         519      521       +2     
  Lines       83819    84305     +486     
==========================================
+ Hits        69862    70261     +399     
- Misses      13957    14044      +87

Files with missing lines	Coverage Δ
src/common/libsubprocess/bulk-exec.c	`77.98% <100.00%> (ø)`
src/modules/job-exec/exec.c	`80.88% <100.00%> (ø)`
src/modules/job-manager/event.c	`78.30% <100.00%> (-0.05%)`	⬇️
src/modules/job-manager/alloc.c	`75.32% <75.00%> (-0.60%)`	⬇️
src/modules/job-manager/job-manager.c	`63.49% <55.55%> (-0.31%)`	⬇️
src/cmd/flux-housekeeping.py	`83.16% <83.16%> (ø)`
src/modules/job-manager/housekeeping.c	`78.43% <78.43%> (ø)`

... and 11 files with indirect coverage changes

github-advanced-security bot found potential problems Mar 21, 2024

View reviewed changes

src/modules/job-manager/housekeeping.c Fixed Show fixed Hide fixed

garlick force-pushed the housekeeping branch 3 times, most recently from 242e109 to 9713a67 Compare March 25, 2024 21:33

grondo mentioned this pull request Mar 29, 2024

need a way for job manager epilog to implement "partial release" #4312

Open

garlick mentioned this pull request Mar 30, 2024

flux-shutdown: need option to force fast shutdown #5843

Open

garlick force-pushed the housekeeping branch from 9713a67 to b61b655 Compare April 3, 2024 00:11

garlick force-pushed the housekeeping branch from b61b655 to 10330a9 Compare April 18, 2024 14:08

garlick force-pushed the housekeeping branch from 10330a9 to cc2033f Compare May 8, 2024 13:35

garlick force-pushed the housekeeping branch from cc2033f to 1890085 Compare June 6, 2024 14:57

garlick mentioned this pull request Jun 13, 2024

systemd: add prolog/epilog service units #6040

Merged

garlick force-pushed the housekeeping branch from 1890085 to eda69c9 Compare June 17, 2024 21:00

github-advanced-security bot found potential problems Jun 17, 2024

View reviewed changes

garlick force-pushed the housekeeping branch from eda69c9 to ca3117d Compare June 17, 2024 22:19

garlick mentioned this pull request Jun 29, 2024

add draft/experimental feature class #6066

Closed

garlick changed the title ~~WIP: job-manager: add support for housekeeping scripts with partial release of resources~~ ob-manager: add support for housekeeping scripts with partial release of resources Jun 29, 2024

garlick changed the title ~~ob-manager: add support for housekeeping scripts with partial release of resources~~ job-manager: add support for housekeeping scripts with partial release of resources Jun 29, 2024

grondo force-pushed the housekeeping branch from 2881960 to 088df88 Compare June 30, 2024 01:05

garlick force-pushed the housekeeping branch from b91a4b1 to 68dbf3a Compare July 1, 2024 19:09

grondo approved these changes Jul 2, 2024

View reviewed changes

garlick added the merge-when-passing label Jul 2, 2024

garlick and others added 12 commits July 2, 2024 01:40

libsubprocess: add ranks parameter to bulk kill

4ddfca2

Problem: bulk_exec does not offer an interface to kill a subset of the running subprocesses. Add a ranks parameter to bulk_exec_kill() and bulk_exec_imp_kill(). Set it to NULL to target all subprocesses like before. Update test and users.

cmd: add flux-housekeeping(1)

59857ab

Problem: There is interface to query information about the housekeeping subsystem. Add a new flux-housekeeping(1) command, which can be used to query or terminate housekeeping tasks.

etc: add flux-housekeeping(1) bash completions

875fde8

Problem: The flux-housekeeping(1) command doesn't have any bash tab completions. Add them.

testsuite: cover job-manager housekeeping

201cf87

Problem: there is no test coverage for job-manager housekeeping. Add a sharness script.

doc: add flux-housekeeping(1)

b30db05

Problem: There is no manual for flux-housekeeping(1). Add a simple man page for flux-housekeeping(1).

doc: add housekeeping documentation to flux-config-job-manager(5)

18c70b4

Problem: The job-manager.housekeeping config table is not documented. Document it in flux-config-job-manager(5).

man: add github issue tracker to resources

d82c356

Problem: the EXPERIMENTAL section refers to the project issue tracker but man pages do not contain the URL. Add it.

garlick force-pushed the housekeeping branch from 68dbf3a to d82c356 Compare July 2, 2024 01:40

mergify bot merged commit c3df073 into flux-framework:master Jul 2, 2024
32 checks passed

garlick deleted the housekeeping branch July 2, 2024 16:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

job-manager: add support for housekeeping scripts with partial release of resources #5818

job-manager: add support for housekeeping scripts with partial release of resources #5818

garlick commented Mar 21, 2024 •

edited

Loading

garlick commented Apr 3, 2024

garlick commented Apr 3, 2024

grondo commented Apr 3, 2024

garlick commented Apr 3, 2024

grondo commented Apr 3, 2024

grondo commented Apr 3, 2024

grondo commented Apr 3, 2024

garlick commented Apr 25, 2024

ryanday36 commented Apr 25, 2024

grondo commented Apr 25, 2024

garlick commented Apr 25, 2024 •

edited

Loading

grondo commented Apr 25, 2024

grondo commented Apr 25, 2024

ryanday36 commented Apr 25, 2024

garlick commented Jun 6, 2024

garlick commented Jun 17, 2024

garlick commented Jun 29, 2024

grondo commented Jun 29, 2024

grondo commented Jun 30, 2024

garlick commented Jun 30, 2024

grondo commented Jul 1, 2024

garlick commented Jul 1, 2024

grondo left a comment

garlick commented Jul 2, 2024

codecov bot commented Sep 30, 2024

job-manager: add support for housekeeping scripts with partial release of resources #5818

job-manager: add support for housekeeping scripts with partial release of resources #5818

Conversation

garlick commented Mar 21, 2024 • edited Loading

garlick commented Apr 3, 2024

garlick commented Apr 3, 2024

grondo commented Apr 3, 2024

garlick commented Apr 3, 2024

grondo commented Apr 3, 2024

grondo commented Apr 3, 2024

grondo commented Apr 3, 2024

garlick commented Apr 25, 2024

ryanday36 commented Apr 25, 2024

grondo commented Apr 25, 2024

garlick commented Apr 25, 2024 • edited Loading

grondo commented Apr 25, 2024

grondo commented Apr 25, 2024

ryanday36 commented Apr 25, 2024

garlick commented Jun 6, 2024

garlick commented Jun 17, 2024

garlick commented Jun 29, 2024

grondo commented Jun 29, 2024

grondo commented Jun 30, 2024

garlick commented Jun 30, 2024

grondo commented Jul 1, 2024

garlick commented Jul 1, 2024

grondo left a comment

Choose a reason for hiding this comment

garlick commented Jul 2, 2024

codecov bot commented Sep 30, 2024

Codecov Report

garlick commented Mar 21, 2024 •

edited

Loading

garlick commented Apr 25, 2024 •

edited

Loading