-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enforcing user policy limits #638
Comments
@SteVwonder: I believe this should be a topic we need to discuss at a coffee hour. Maybe we should do this at 2PM today. |
Summarizing the coffee time discussion relevant to this issue:
In terms of how to handle static, per-queue limits, that is a separate issue (#642). |
I'd like to summarize our design space for adding limit support in a table. Folks, please help grow/refine the table. As we discussed, there appear to be two different categories of limits. So I used it as the first dimension in our taxonomy and classify them into static vs. dynamic first. Then, for each category, there seems to be two major kinds. So I added multi-queue aware vs. multi-queue-agnostic into the second column. For each kind, we have actual limits. I only capture two static limits (Max job size and walltime) and two dynamic limits (Max running jobs per use and Max aggregate resources per user). If there are other limits we should consider, please grow the table. I also added a mechanism column to capture how each limit should be implemented based on the discussions so far. And I also added Day 1 column to prioritize which needs to be or can be done for our next milestone. Finally, I added the proposed semantics when a job hit the corresponding limit.
Edit 1: Use "skip" for the handling semantics of dynamic limit. |
@cmoussa1 or @SteVwonder: how does SLURM handle a job that exceeds the dynamic limits? If it rejects the job, then we can model after it and just don't have to worry about that weird interplay between the queuing policy vs. limits. But I added reject (or skip) for now. Since we are talking about a "limit", it may just make sense to reject the job. Just like OS nproc limit will reject a process from a user once the user exceeds that limit, we may as well reject the job for simplicity and clarity in our limit handling semantics. Just a thought. |
Thanks Dong for the table, I think that helps summarize the situation. Do we want to support muli-queue aware max aggregate resources per user on day 1? After our recent coffee time discussion, I thought that would require too much complexity and runtime cost for us to be comfortable handling that day 1. IIUC, this would require the qmanager to traverse the entire resource section of jobspec to calculate the aggregate resource requirements or for us to push the logic down into the
From the Slurm docs: Personally, I'm view the dynamic limits as more of a "throttling" or a "soft limit" since the system state can change to allow the jobs to run successfully in the future, and then static ones are a "hard limit" (i.e., this job is never going to run with this particular combination of user, queue, and resources). That may not be the right way to think about though; just sharing my mental model. |
OK. Thanks. Good to know. I think what makes sense is to grow the table above with a clean taxonomy consisting of limit classes, limits and handling semantics and design/implement our solutions accordingly with "consistent behavior". Even from our discussions, there has been lots of confusion :-) Then we should be able to describe our system in terms of classes of limits instead of each individual limit. The latter is pretty ad hoc. At the end of the day, we may need an RFC. Seems the first set of handling semantics appears from our discussion is:
|
@dongahn: per your question on the coffee call about does Slurm make reservations for jobs that have exceeded a dynamic limit, the answer (AFAICT from digging through the source code), is no, they do not. In their backfill plugin, they check if any dynamic limits are exceeded here and here, and it's not for another couple of hundred lines later that they attempt any backfilling starting here. So if we "skip" jobs once a user exceeds their dynamic limit, I believe our behavior will be in line with Slurm's. |
I don't know how to do this yet but I thought it would be good to target it day 1 if possible. A path: At this point, we don't pass user_id into Edit: the initial posting was so poorly worded. Sorry. |
Yes, this would be a sane way to handle this. It would be very difficult to reason about the effect of a queuing policy when it is combined with some other "semi" scheduling like limits. And that was my fear. |
Large function, that is... |
IIUC, in terms of the handling of jobs with regard to fairshare, once a resource limit has been reached ( I'm not sure I have a good answer/suggestion as to how we want to handle this come day 1, but I thought I'd at least share what I know about Slurm's approach to handling. |
Thanks @cmoussa1: I believe you are referring to the following rows.
With respect to your comment:
Isn't this what we want as to how to handle this limit? Are you saying this will be done as part of fairshare? Maybe I'm missing something. |
At this point, I am really wondering about the property of fairshare changes. Does the fairshare of a user change such a way that a user can monopolize the system? This probably is a function of a user's group and jobs that are currently queued up. Under what condition, then a user can monopolize the system per his/her fairshare? Does this condition occur frequently? Maybe someone can build an analytic model or similar to reason about this space? |
Just to address your question above in some writing, @dongahn: No matter how many jobs a user submits their priorities are always changing based on their usage. We talked two days ago about trying to design a set of static limits that could be used to generate job priorities. So, I've started looking at some static limits that can be configured to generate an integer priority for a job. So far, this is what I've come up with:
These three factors would be used to calculate a job priority p:
Each of the Here's a little more explanation on the static factors that would be used in the priority calculation: Job SizeThis correlates to the number of nodes/CPUs the job has requested. This could be configured to favor larger jobs or smaller jobs. PartitionEach node partition could be assigned an integer priority (e.g. the NiceUsers can adjust the priority of their own jobs by setting a nice value on their jobs. Positive values negatively impact a job's priority. Only privileged users can specify a negative value. The higher the positive value, the more it negatively impacts their job's priority. Other factors, like age and fairshare, are dynamic limits, so I did not include them in the equation above. There are other static factors, like a Trackable Resource (TRES for short) factor, where each TRES has its own factor for a job which represents the number of allocated/requested TRES type in a given partition. There is also the QOS factor, which allows each QOS to be assigned an integer priority. The larger the number, the greater the job priority will be for jobs that request this QOS (similar to the partition factor). IIRC, we weren't sure if we wanted to include both partitions and QOS's. An idea that I have (which is probably naive, but I figured I'd throw it out there), is to define a QOS-configurable attribute within a partition. For example, in Slurm right now, you could have a list of partitions (
Like I mentioned above, this is just me tossing around an idea, and I'm not entirely sure if it's feasible or well thought-out. I don't have any experience with submitting jobs with a QOS. I think this approach allows us to keep just one Hopefully this made some sense and is at least a start for us to narrow down the user policy limits! |
Thanks @cmoussa1 for the write up.
Is this nice factor the priority factor that is already implemented in the job-manager, or is it something separate? I ask because I always thought of the priority in job-manager as a sort of "priority class" where the highest priority jobs get serviced/considered first before lower class jobs are. Similar to how one of the network schedulers is linux works [1]:
PS - Not to say that this is the "right way" to think about the nice priority, I just want make sure I'm aligning my mental model with the discussion and the rest of the team's model. |
To me, that sounds like how it works now, since jobs are ordered first by priority, then submit time. (If there is a difference between just sorting by priority,time and the WRR scheme described above, then I don't get it). Whether it stays that way depends on if the current "submit priority" becomes an input factor in the priority output by a priority plugin, or if a separate priority is generated and the job-manager continues to sort on submit priority first, then secondary priority (of which submit_time would presumably be a factor). (I realize after typing that this is obvious, but it helped to type it up) It seems there are benefits to either approach. If we keep the primary (submit) priority as a separate, primary sort key, then things like job hold and expedite could be more easily implemented as a simple adjustment of this one priority. If the submit priority is just one factor in the final priority, though, then I think it does satisfy the use case for a "nice" value since it is already adjustable by the user (lower only). However, then we'd need a different method for hold/expedite. If we keep the primary priority, I can't imagine a use case for 32 "priority classes", where the jobs of each class always run before all other jobs of the lower class. This would mean that, in a typical case, if a user submitted a job with a priority 1 lower than the default, it would run after all other jobs on the system... |
Would this mean that a nice factor would be left out of a final priority calculation? If so, as far as static limits go, I think that would leave the following:
Would a primary sort key adjust job priorities after its initial priority p is calculated? |
Thank you for furthering this discussion and sorry I'm coming at this late.
My apology. I got a bit confused. Do the static limits affect job priority calculation other than "if your job exceed a limit, your job will not be rejected or not be scheduled"? It seems you are referring to static priority factors? |
Yeah, considering those two properties as the static factors of priority calculation makes sense to me. One nit: we use the notion of multiple "queues" instead of "partitions" as this term conveys the concept of overlapping resource sets a bit better. So my preference is to use "queues" for this as well. So my preference is to use "queue". |
Is the proposal to use this as one of the sorting criteria at the I had to think through, but it may work at that level. Even if the jobs are sorted at In my initially mental model on how the jobs flow through multiple queues and are sorted:
At least with the static priority factors, whether we do that sorting at the job-manager level or at the Two issues:
|
@cmoussa1: somehow this ticket has been morphed into a static priority factor discussion from a policy limit discussion. Perhaps we should move the priority factor discussion piece into a new ticket. I don't mind if this goes to flux-accounting? |
I don't believe they do. I guess the accounting side only contains static limits, not static priority factors. As of now, flux-accounting contains:
If, on submission, at least one of these static limits is exceeded, then a user's job will not be scheduled until all three limits are satisfied.
Understood, my mistake!
I can move this discussion to flux-framework/flux-accounting #9. |
This issue is more broad than the topic implies, but now that we have nailed down
It seems like these problems are less wide-ranging now we could open narrower scope issues in core, accounting, and sched as needed. (Please reopen if I"m mistaken) |
Related to #286, #287, #529, and #637.
As it stands now with Slurm and other schedulers, you can set limits on things like walltime, job size, simultaneous numbers of submitted/running jobs (controlled independently), simultaneous number of nodes used at the granularity of a user, a partition/queue, or a QoS (i.e., job(s) with a particular label applied).
We need to decide A) which of these limits we want to support B) where we want to enforce these limits.
#637 lays out our current plans for implementing multiple partitions/queues and implicit in that is controlling the number of nodes assigned to each partition/queue.
I'm not sure if we've documented it anywhere, but we can pretty easily handle the walltime and job size limits at the job-ingest validator. Note: if we enable modifying jobspec post-submission, we will need to re-validate. If anyone disagrees, we should open a separate issue to discuss there.
At the end of #529, it sounds like we want to avoid QoS if possible and instead leverage overlapping queues/partitions. I believe the idea there is to have many queues, each with their own limits, and then when you want to emulate the
expedite
QoS, you would move the job from thebatch
queue to theexpedite
queue (if we want to go that route, we should open a ticket to track it since that will require editing jobspec or a new interface to qmanager).So the gaps that I see are:
debug
(and ignorebatch
).debug
. Not sure about other use-cases.All of those gaps look to be for controlling the behavior of a "bad actor" (could be trying to exploit
debug
queue for production work or could be unaware that they are submitting to the wrong queue, too many jobs, etc.) So ultimately, I think this is a relatively low priority item to get implemented during our rollout, but I wanted to bring this up so we at least have it in mind as we design other policy limiting functionality.The text was updated successfully, but these errors were encountered: