Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

limits: add max-nodes limit support #349

Open
2 of 5 tasks
Tracked by #4434
cmoussa1 opened this issue May 10, 2023 · 15 comments
Open
2 of 5 tasks
Tracked by #4434

limits: add max-nodes limit support #349

cmoussa1 opened this issue May 10, 2023 · 15 comments
Labels
new feature new feature

Comments

@cmoussa1
Copy link
Member

cmoussa1 commented May 10, 2023

This was originally mentioned in #121 along with a couple of other limits to be enforced. A max-nodes limit enforces a max number of nodes that a user/bank can have across all of their running jobs. max-nodes is already an attribute for a user/bank row in the flux-accounting database and has a default value of 2147483647.

There was some good discussion in #121 about the logistics of supporting a max-nodes limit that I'll paste here.

The number of nodes that will eventually be allocated to a pending job request is not known at the time of submission, so it might be difficult for flux-accounting's jobtap plugin to enforce this limit. We left the conversation off with concluding that only the scheduler can reasonably apply this limit, since only it will know how many nodes would be allocated to the next job, and can put that job on hold if it would exceed the configured maximum, and perhaps enabling some sort of communication between the plugin and the scheduler where the plugin could send data about a user/bank's max-nodes limit to the scheduler where it can be tracked and enforced there.

Things might have changed since our last conversation, though, and maybe there are other solutions to potentially consider!

Tasks

Preview Give feedback
  1. improvement
    cmoussa1
  2. new feature plugin
    cmoussa1
  3. new feature plugin
@cmoussa1 cmoussa1 added the new feature new feature label May 10, 2023
@garlick
Copy link
Member

garlick commented May 10, 2023

For further context: I ran into the same issue when implementing the limit-job-size jobtap plugin, which enforces system-wide and per-queue job size limits described in RFC 33 based on the requested node count. flux-config-policy(5) documents the resulting limitation:

Note

Limit checks take place before the scheduler sees the request, so it is possible to bypass a node limit by requesting only cores, or the core limit by requesting only nodes (exclusively) since this part of the system does not have detailed resource information. Generally node and core limits should be configured in tandem to be effective on resource sets with uniform cores per node. Flux does not yet have a solution for node/core limits on heterogeneous resources.

I'm not sure to what extent implementing a per-user limit with the same caveat is helpful here. Maybe it would be enough to check the box on the project plan for production rollout and the scheduler-based check could come later?

@cmoussa1
Copy link
Member Author

I should follow up on this issue with some of the discussion we had during yesterday's team meeting. We talked briefly about perhaps a first-cut at this issue would be to look at core (also node?) counts from jobspec when a job is submitted and try to keep track there, holding jobs that would put a user/bank over their max_nodes limit until a currently running job exits and frees up resources.

This would require the plugin to know some details about the system it's running on, correct? i.e it would have to know how many cores are on a node? If so, the plugin should be able to unpack this information either from a config file, where it can be parsed and sent to the plugin, maybe in the priority-update script since it already sends information to the plugin.

@garlick
Copy link
Member

garlick commented May 11, 2023

I'm not sure if everyone will agree, but I would suggest for a first cut, just follow the pattern of the simple limit checks implemented in limit-job-size which have no knowledge of the resources or whether node-exclusive scheduling is configured.

To take the next step we'll either want to share some information between subsystems that is not currently shared, or require the scheduler to do some of the checks. But for systems where all the nodes have the same number of cores it shouldn't be too bad to set limits so that max_cores = (max_nodes * cores_per_node).

@grondo
Copy link
Contributor

grondo commented May 11, 2023

The limit-job-size plugin has the advantage that it only needs to consider the current job.

For a maximum node count or maximum core count limit, the accounting plugin would need to consider all currently allocated jobs for the current user/bank along with the current job. This will be tricky if some allocated jobs only specified cores while others only specified nodes, since all currently allocated jobs will need to be accounted for when applying the limit.

Perhaps we could make it simple for a plugin to get the total allocated cores and nodes from the job.state.run callback so that the accounting plugin can keep an accurate count for both. The drawback would be that R would have to be fetched after it is written to the KVS because I don't think the job-manager has that information at this time. (and that might introduce a race condition between when one job enters the RUN state and its allocation count is added to the total, vs when the next job's limit is checked...)

@garlick
Copy link
Member

garlick commented May 11, 2023

Great points, I wasn't thinking about that aspect of it, just about the requested resources.

Possibly we could just have the scheduler return R in the alloc response to avoid re-fetching it from the KVS? We may be able to make that change in libschedutil since the scheduler already passes R back to schedutil_alloc_respond_success_pack().

@grondo
Copy link
Contributor

grondo commented May 11, 2023

Yeah, that would make things easier! I was actually half remembering we already thought of doing that, but I'm not sure we ever actually did it...

@garlick
Copy link
Member

garlick commented May 11, 2023

Heh me too, although we followed a similar pattern when we added jobspec to the alloc request, so maybe we are thinking of that.

@cmoussa1 cmoussa1 self-assigned this May 17, 2023
@cmoussa1
Copy link
Member Author

Do you think we should open a separate issue on this (i.e opening one on having the scheduler return R in the alloc response)? Not sure if we're thinking if this might need more discussion/thought. :-)

@cmoussa1
Copy link
Member Author

cmoussa1 commented Jun 7, 2023

@grondo, @garlick this was the flux-accounting issue I was referring to in our meeting today. :-)

Also tagging @milroy here since we've mentioned that this will most likely involve/require flux-sched support to be able to enforce a max-nodes user limit.

@grondo
Copy link
Contributor

grondo commented Jun 8, 2023

Thanks @cmoussa1. Re-reading above it seems like the current summary is:

  • as a first cut, implement holistic limits in flux-accounting instead of a max-nodes limit. I.e. impose a max-nodes+max-cores limit for users across all jobs
  • as a prerequisite, the accounting plugin will need access to the actual resource counts assigned jobs, so that nnodes for core-only requests and ncores for nodes-only requests can be accounted. Therefore jobtap: add allocated resource information in job.state.run callbacks flux-core#3851 should be solved first.

Edit: Forgot to mention that more design work is needed on how to loop the scheduler into these limits so that a real max-nodes limit could be imposed.

Also, there is probably a race condition we should consider if the flux-accounting jobtap plugin is enforcing these liimits. E.g. a max-nodes limit could be exceeded or hit during one job's job.stat.run callback, while at the same time the scheduler is allocating more nodes to that user before the plugin has a chance to hold all pending jobs for the user.

@cmoussa1 cmoussa1 removed their assignment Jan 16, 2024
@cmoussa1
Copy link
Member Author

Now that flux-framework/flux-core#3851 has been closed, I plan on taking a look at this issue and seeing if I can get anywhere useful here - will update this thread as needed. :-)

@cmoussa1
Copy link
Member Author

One question that I've just realized while starting to play with this - if I am understanding flux-framework/flux-core#5472, R can be unpacked in the priority plugin in the callback for job.state.run, correct? And we can unpack R like the following:

if (flux_plugin_arg_unpack (args,
                                FLUX_PLUGIN_ARG_IN,
                                "{s?o}",
                                "R", &R) < 0)

And grab attributes from it (this is the R from a single node job):

{
  "version": 1,
  "execution": {
    "R_lite": [
      {
        "rank": "0",
        "children": {
          "core": "0-1"
        }
      }
    ],
    "starttime": 1711648632.4934533,
    "expiration": 0.0,
    "nodelist": [
      "docker-desktop"
    ]
  }
}

If, for example, a user/bank is already at their max nodes limit across all of their currently running jobs, would the plan here be to raise an exception on this job? I don't believe we can add a job dependency at this point in a job's life, but I could definitely be mistaken or could be misunderstanding what a solid approach would be here.

@grondo
Copy link
Contributor

grondo commented Mar 28, 2024

we can unpack R like the following:

Yes, that looks right.

If, for example, a user/bank is already at their max nodes limit across all of their currently running jobs, would the plan here be to raise an exception on this job? I don't believe we can add a job dependency at this point in a job's life, but I could definitely be mistaken or could be misunderstanding what a solid approach would be here.

I think at job.state.run you'd not be applying the limit, but instead in this callback you'd check for the number of nodes/cores in R and then increment the current allocated node/core count for a user/bank.

Then in the job.state.depend callback you can estimate the number of nodes/cores that would be allocated by the job's jobspec (this is where the discussion about using cores or cores and/or nodes comes in, since it will be difficult if not impossible to estimate nnodes exactly for a given jobspec). If the result + current core/nodes count would be greater than the current max, then this job could get a max-resources or similar dependency, exactly like you do for max jobs.

@ryanday36
Copy link
Contributor

Hi @cmoussa1, have been able to make any progress on this issue? Folks are looking to limit running jobs per user on Tuolumne as it gets closer to being generally available.

@cmoussa1
Copy link
Member Author

cmoussa1 commented Jan 6, 2025

There is an open PR to try and get things kicked off to support this - #469 is currently under review and is looking to at least initialize the plugin with a cores-per-node estimation, which would allow the plugin to at least enforce some limits per user using total cores instead of nodes. Once that lands, I think I can start adding some tracking in the plugin to look at a total count across a user's jobs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new feature new feature
Projects
None yet
Development

No branches or pull requests

4 participants