limits: add max-nodes limit support #349

cmoussa1 · 2023-05-10T16:34:24Z

This was originally mentioned in #121 along with a couple of other limits to be enforced. A max-nodes limit enforces a max number of nodes that a user/bank can have across all of their running jobs. max-nodes is already an attribute for a user/bank row in the flux-accounting database and has a default value of 2147483647.

There was some good discussion in #121 about the logistics of supporting a max-nodes limit that I'll paste here.

The number of nodes that will eventually be allocated to a pending job request is not known at the time of submission, so it might be difficult for flux-accounting's jobtap plugin to enforce this limit. We left the conversation off with concluding that only the scheduler can reasonably apply this limit, since only it will know how many nodes would be allocated to the next job, and can put that job on hold if it would exceed the configured maximum, and perhaps enabling some sort of communication between the plugin and the scheduler where the plugin could send data about a user/bank's max-nodes limit to the scheduler where it can be tracked and enforced there.

Things might have changed since our last conversation, though, and maybe there are other solutions to potentially consider!

Tasks

Give feedback

jobtap: add allocated resource information in job.state.run callbacks flux-core#3851
plugin: send max nodes information per-association #436

improvement
plugin: create estimation of node count helper function #440

new feature plugin
enforce node count in job.state.depend and add dependency if needed
plugin: add estimation of cores-per-node count on system during initialization #469

new feature plugin
Options

The text was updated successfully, but these errors were encountered:

garlick · 2023-05-10T17:06:47Z

For further context: I ran into the same issue when implementing the limit-job-size jobtap plugin, which enforces system-wide and per-queue job size limits described in RFC 33 based on the requested node count. flux-config-policy(5) documents the resulting limitation:

Note

Limit checks take place before the scheduler sees the request, so it is possible to bypass a node limit by requesting only cores, or the core limit by requesting only nodes (exclusively) since this part of the system does not have detailed resource information. Generally node and core limits should be configured in tandem to be effective on resource sets with uniform cores per node. Flux does not yet have a solution for node/core limits on heterogeneous resources.

I'm not sure to what extent implementing a per-user limit with the same caveat is helpful here. Maybe it would be enough to check the box on the project plan for production rollout and the scheduler-based check could come later?

cmoussa1 · 2023-05-11T15:51:08Z

I should follow up on this issue with some of the discussion we had during yesterday's team meeting. We talked briefly about perhaps a first-cut at this issue would be to look at core (also node?) counts from jobspec when a job is submitted and try to keep track there, holding jobs that would put a user/bank over their max_nodes limit until a currently running job exits and frees up resources.

This would require the plugin to know some details about the system it's running on, correct? i.e it would have to know how many cores are on a node? If so, the plugin should be able to unpack this information either from a config file, where it can be parsed and sent to the plugin, maybe in the priority-update script since it already sends information to the plugin.

garlick · 2023-05-11T16:44:53Z

I'm not sure if everyone will agree, but I would suggest for a first cut, just follow the pattern of the simple limit checks implemented in limit-job-size which have no knowledge of the resources or whether node-exclusive scheduling is configured.

To take the next step we'll either want to share some information between subsystems that is not currently shared, or require the scheduler to do some of the checks. But for systems where all the nodes have the same number of cores it shouldn't be too bad to set limits so that max_cores = (max_nodes * cores_per_node).

grondo · 2023-05-11T17:29:38Z

The limit-job-size plugin has the advantage that it only needs to consider the current job.

For a maximum node count or maximum core count limit, the accounting plugin would need to consider all currently allocated jobs for the current user/bank along with the current job. This will be tricky if some allocated jobs only specified cores while others only specified nodes, since all currently allocated jobs will need to be accounted for when applying the limit.

Perhaps we could make it simple for a plugin to get the total allocated cores and nodes from the job.state.run callback so that the accounting plugin can keep an accurate count for both. The drawback would be that R would have to be fetched after it is written to the KVS because I don't think the job-manager has that information at this time. (and that might introduce a race condition between when one job enters the RUN state and its allocation count is added to the total, vs when the next job's limit is checked...)

garlick · 2023-05-11T17:37:33Z

Great points, I wasn't thinking about that aspect of it, just about the requested resources.

Possibly we could just have the scheduler return R in the alloc response to avoid re-fetching it from the KVS? We may be able to make that change in libschedutil since the scheduler already passes R back to schedutil_alloc_respond_success_pack().

grondo · 2023-05-11T17:38:47Z

Yeah, that would make things easier! I was actually half remembering we already thought of doing that, but I'm not sure we ever actually did it...

garlick · 2023-05-11T17:39:54Z

Heh me too, although we followed a similar pattern when we added jobspec to the alloc request, so maybe we are thinking of that.

cmoussa1 · 2023-05-17T17:00:15Z

Do you think we should open a separate issue on this (i.e opening one on having the scheduler return R in the alloc response)? Not sure if we're thinking if this might need more discussion/thought. :-)

cmoussa1 · 2023-06-07T23:46:08Z

@grondo, @garlick this was the flux-accounting issue I was referring to in our meeting today. :-)

Also tagging @milroy here since we've mentioned that this will most likely involve/require flux-sched support to be able to enforce a max-nodes user limit.

grondo · 2023-06-08T14:27:50Z

Thanks @cmoussa1. Re-reading above it seems like the current summary is:

as a first cut, implement holistic limits in flux-accounting instead of a max-nodes limit. I.e. impose a max-nodes+max-cores limit for users across all jobs
as a prerequisite, the accounting plugin will need access to the actual resource counts assigned jobs, so that nnodes for core-only requests and ncores for nodes-only requests can be accounted. Therefore jobtap: add allocated resource information in job.state.run callbacks flux-core#3851 should be solved first.

Edit: Forgot to mention that more design work is needed on how to loop the scheduler into these limits so that a real max-nodes limit could be imposed.

Also, there is probably a race condition we should consider if the flux-accounting jobtap plugin is enforcing these liimits. E.g. a max-nodes limit could be exceeded or hit during one job's job.stat.run callback, while at the same time the scheduler is allocating more nodes to that user before the plugin has a chance to hold all pending jobs for the user.

cmoussa1 · 2024-03-28T17:14:16Z

Now that flux-framework/flux-core#3851 has been closed, I plan on taking a look at this issue and seeing if I can get anywhere useful here - will update this thread as needed. :-)

cmoussa1 · 2024-03-28T18:00:23Z

One question that I've just realized while starting to play with this - if I am understanding flux-framework/flux-core#5472, R can be unpacked in the priority plugin in the callback for job.state.run, correct? And we can unpack R like the following:

if (flux_plugin_arg_unpack (args,
                                FLUX_PLUGIN_ARG_IN,
                                "{s?o}",
                                "R", &R) < 0)

And grab attributes from it (this is the R from a single node job):

{
  "version": 1,
  "execution": {
    "R_lite": [
      {
        "rank": "0",
        "children": {
          "core": "0-1"
        }
      }
    ],
    "starttime": 1711648632.4934533,
    "expiration": 0.0,
    "nodelist": [
      "docker-desktop"
    ]
  }
}

If, for example, a user/bank is already at their max nodes limit across all of their currently running jobs, would the plan here be to raise an exception on this job? I don't believe we can add a job dependency at this point in a job's life, but I could definitely be mistaken or could be misunderstanding what a solid approach would be here.

grondo · 2024-03-28T18:12:13Z

we can unpack R like the following:

Yes, that looks right.

If, for example, a user/bank is already at their max nodes limit across all of their currently running jobs, would the plan here be to raise an exception on this job? I don't believe we can add a job dependency at this point in a job's life, but I could definitely be mistaken or could be misunderstanding what a solid approach would be here.

I think at job.state.run you'd not be applying the limit, but instead in this callback you'd check for the number of nodes/cores in R and then increment the current allocated node/core count for a user/bank.

Then in the job.state.depend callback you can estimate the number of nodes/cores that would be allocated by the job's jobspec (this is where the discussion about using cores or cores and/or nodes comes in, since it will be difficult if not impossible to estimate nnodes exactly for a given jobspec). If the result + current core/nodes count would be greater than the current max, then this job could get a max-resources or similar dependency, exactly like you do for max jobs.

ryanday36 · 2025-01-06T18:42:56Z

Hi @cmoussa1, have been able to make any progress on this issue? Folks are looking to limit running jobs per user on Tuolumne as it gets closer to being generally available.

cmoussa1 · 2025-01-06T19:24:54Z

There is an open PR to try and get things kicked off to support this - #469 is currently under review and is looking to at least initialize the plugin with a cores-per-node estimation, which would allow the plugin to at least enforce some limits per user using total cores instead of nodes. Once that lands, I think I can start adding some tracking in the plugin to look at a total count across a user's jobs.

cmoussa1 added the new feature new feature label May 10, 2023

garlick mentioned this issue May 10, 2023

Support enforcement of all necessary limits (e.g. total nodes for user+bank) flux-framework/flux-core#4434

Open

6 tasks

cmoussa1 self-assigned this May 17, 2023

ryanday36 mentioned this issue Dec 1, 2023

per-queue user limits #402

Open

cmoussa1 removed their assignment Jan 16, 2024

cmoussa1 mentioned this issue Jul 11, 2024

is there a way to count/estimate the number of cores-per-node on a system instance? flux-framework/flux-core#6091

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

limits: add max-nodes limit support #349

limits: add max-nodes limit support #349

cmoussa1 commented May 10, 2023 •

edited

Loading

Tasks

garlick commented May 10, 2023

cmoussa1 commented May 11, 2023

garlick commented May 11, 2023

grondo commented May 11, 2023

garlick commented May 11, 2023 •

edited

Loading

grondo commented May 11, 2023

garlick commented May 11, 2023

cmoussa1 commented May 17, 2023

cmoussa1 commented Jun 7, 2023

grondo commented Jun 8, 2023 •

edited

Loading

cmoussa1 commented Mar 28, 2024

cmoussa1 commented Mar 28, 2024

grondo commented Mar 28, 2024

ryanday36 commented Jan 6, 2025

cmoussa1 commented Jan 6, 2025 •

edited

Loading

limits: add max-nodes limit support #349

limits: add max-nodes limit support #349

Comments

cmoussa1 commented May 10, 2023 • edited Loading

Tasks

garlick commented May 10, 2023

cmoussa1 commented May 11, 2023

garlick commented May 11, 2023

grondo commented May 11, 2023

garlick commented May 11, 2023 • edited Loading

grondo commented May 11, 2023

garlick commented May 11, 2023

cmoussa1 commented May 17, 2023

cmoussa1 commented Jun 7, 2023

grondo commented Jun 8, 2023 • edited Loading

cmoussa1 commented Mar 28, 2024

cmoussa1 commented Mar 28, 2024

grondo commented Mar 28, 2024

ryanday36 commented Jan 6, 2025

cmoussa1 commented Jan 6, 2025 • edited Loading

cmoussa1 commented May 10, 2023 •

edited

Loading

garlick commented May 11, 2023 •

edited

Loading

grondo commented Jun 8, 2023 •

edited

Loading

cmoussa1 commented Jan 6, 2025 •

edited

Loading