-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
limits: add max-nodes limit support #349
Comments
For further context: I ran into the same issue when implementing the
I'm not sure to what extent implementing a per-user limit with the same caveat is helpful here. Maybe it would be enough to check the box on the project plan for production rollout and the scheduler-based check could come later? |
I should follow up on this issue with some of the discussion we had during yesterday's team meeting. We talked briefly about perhaps a first-cut at this issue would be to look at core (also node?) counts from jobspec when a job is submitted and try to keep track there, holding jobs that would put a user/bank over their This would require the plugin to know some details about the system it's running on, correct? i.e it would have to know how many cores are on a node? If so, the plugin should be able to unpack this information either from a config file, where it can be parsed and sent to the plugin, maybe in the |
I'm not sure if everyone will agree, but I would suggest for a first cut, just follow the pattern of the simple limit checks implemented in limit-job-size which have no knowledge of the resources or whether node-exclusive scheduling is configured. To take the next step we'll either want to share some information between subsystems that is not currently shared, or require the scheduler to do some of the checks. But for systems where all the nodes have the same number of cores it shouldn't be too bad to set limits so that |
The limit-job-size plugin has the advantage that it only needs to consider the current job. For a maximum node count or maximum core count limit, the accounting plugin would need to consider all currently allocated jobs for the current user/bank along with the current job. This will be tricky if some allocated jobs only specified cores while others only specified nodes, since all currently allocated jobs will need to be accounted for when applying the limit. Perhaps we could make it simple for a plugin to get the total allocated cores and nodes from the job.state.run callback so that the accounting plugin can keep an accurate count for both. The drawback would be that R would have to be fetched after it is written to the KVS because I don't think the job-manager has that information at this time. (and that might introduce a race condition between when one job enters the RUN state and its allocation count is added to the total, vs when the next job's limit is checked...) |
Great points, I wasn't thinking about that aspect of it, just about the requested resources. Possibly we could just have the scheduler return R in the alloc response to avoid re-fetching it from the KVS? We may be able to make that change in libschedutil since the scheduler already passes R back to |
Yeah, that would make things easier! I was actually half remembering we already thought of doing that, but I'm not sure we ever actually did it... |
Heh me too, although we followed a similar pattern when we added jobspec to the alloc request, so maybe we are thinking of that. |
Do you think we should open a separate issue on this (i.e opening one on having the scheduler return R in the alloc response)? Not sure if we're thinking if this might need more discussion/thought. :-) |
Thanks @cmoussa1. Re-reading above it seems like the current summary is:
Edit: Forgot to mention that more design work is needed on how to loop the scheduler into these limits so that a real max-nodes limit could be imposed. Also, there is probably a race condition we should consider if the flux-accounting jobtap plugin is enforcing these liimits. E.g. a max-nodes limit could be exceeded or hit during one job's |
Now that flux-framework/flux-core#3851 has been closed, I plan on taking a look at this issue and seeing if I can get anywhere useful here - will update this thread as needed. :-) |
One question that I've just realized while starting to play with this - if I am understanding flux-framework/flux-core#5472, R can be unpacked in the priority plugin in the callback for if (flux_plugin_arg_unpack (args,
FLUX_PLUGIN_ARG_IN,
"{s?o}",
"R", &R) < 0) And grab attributes from it (this is the R from a single node job): {
"version": 1,
"execution": {
"R_lite": [
{
"rank": "0",
"children": {
"core": "0-1"
}
}
],
"starttime": 1711648632.4934533,
"expiration": 0.0,
"nodelist": [
"docker-desktop"
]
}
} If, for example, a user/bank is already at their max nodes limit across all of their currently running jobs, would the plan here be to raise an exception on this job? I don't believe we can add a job dependency at this point in a job's life, but I could definitely be mistaken or could be misunderstanding what a solid approach would be here. |
Yes, that looks right.
I think at Then in the |
Hi @cmoussa1, have been able to make any progress on this issue? Folks are looking to limit running jobs per user on Tuolumne as it gets closer to being generally available. |
There is an open PR to try and get things kicked off to support this - #469 is currently under review and is looking to at least initialize the plugin with a cores-per-node estimation, which would allow the plugin to at least enforce some limits per user using total cores instead of nodes. Once that lands, I think I can start adding some tracking in the plugin to look at a total count across a user's jobs. |
This was originally mentioned in #121 along with a couple of other limits to be enforced. A
max-nodes
limit enforces a max number of nodes that a user/bank can have across all of their running jobs.max-nodes
is already an attribute for a user/bank row in the flux-accounting database and has a default value of2147483647
.There was some good discussion in #121 about the logistics of supporting a
max-nodes
limit that I'll paste here.The number of nodes that will eventually be allocated to a pending job request is not known at the time of submission, so it might be difficult for flux-accounting's jobtap plugin to enforce this limit. We left the conversation off with concluding that only the scheduler can reasonably apply this limit, since only it will know how many nodes would be allocated to the next job, and can put that job on hold if it would exceed the configured maximum, and perhaps enabling some sort of communication between the plugin and the scheduler where the plugin could send data about a user/bank's
max-nodes
limit to the scheduler where it can be tracked and enforced there.Things might have changed since our last conversation, though, and maybe there are other solutions to potentially consider!
Tasks
The text was updated successfully, but these errors were encountered: