-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rpc: set the deregistration eval priority to the job priority. #11426
Conversation
Previously when creating an eval for job deregistration, the eval priority was set to the default value irregardless of the job priority. In situations where an operator would want to deregister a high priority job so they could re-register; the evaluation may get blocked for some time on a busy cluster because of the deregsiter priority. If a job had a lower than default priority and was deregistered, the deregister eval would get a priority higher than that of the job. If we attempted to register another job with a higher priority than this, but still below the default, the deregister would be actioned before the register. Both situations described above seem incorrect and unexpected from a user prespective. This fix modifies to behaviour to set the deregister eval priority to that of the job, if available. Otherwise the default value is still used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. But I'm also wondering if we should set a floor on deregistration priority to ensure that jobs are getting removed on a busy cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change LGTM, and it does what it says in the tin 😄 , but I wonder if we're conflating three different priority values into one, not just on this PR and in general.
There's the job's priority, which is used for preemption, then the job registration eval priority and the job deregistration priority, which are use the eval broker queue order.
The problem I see in mixing these 3 is that the situation you described would only happen if job priorities were already being used, which is not always the case as preemption is not something everyone needs to think about.
So if all jobs don't have a priority set, they would need to modify the jobspec, re-register it, and then stop it, potentially affecting preemption and introducing more load into the system before being able to finally reduce it.
I also think that the scheduler and the eval broker interpret priorities in different ways. For the scheduler, a high priority job should be "hard" to stop, meaning that the run
and stop
priority are the inverse.
For the eval broker, the job registration priority shouldn't really impact the job deregistration priority since they are two somewhat isolated operations. The priority order is tied to the type of operation being performed. So a EvalTriggerJobDeregister
should have a higher priority relative than a EvalTriggerJobRegister
eval because you usually want to be able to reduce load before adding more. So maybe EvalTriggerJobDeregister
should always be 100
? Tough that sounds a bit drastic 😅 Or maybe job.Priority + 5
?
But I think the real solution would be to allow operators to set eval priority independently of job priority. For example, add an -eval-priority
flag to nomad job run
and nomad job stop
to override the default values.
tl;dr - Let's merge this. I don't think it will have a significant positive impact, but I do think it brings some behavior more in line with user expectations. Due to https://github.com/hashicorp/nomad/blob/v1.1.6/nomad/eval_broker.go#L865-L873 all pending evals for the same job are ordered by I think this PR does still help in the general sense of
This is a great question. The only case I can think of for not hardcoding de-registrations to P=100 is to prevent DoSing or priority inversion. Say a single cluster runs both low-priority batch and mixed-priority service workloads with a few P=99 services constrained to a small set of nodes that batch workloads aren't allowed on. I think it's easy to imagine the low-priority batch jobs are submitted by code rather than a human, perhaps a CI/CD system or email sending. If due to bug or backpressure or operator intervention the batch jobs aren't completing quickly and need to be stopped, Nomad could receive a flood of low-priority de-registrations. If the P=99 service also needs updating, perhaps to react to the same systemwide incident, its evals are suddenly "stuck" behind the batch de-regs even though the batch de-regs can't affect anything related to the P=99 service's scheduling due to constraints! I think this is the pathological case (please try to come up with a worse one!). On the plus side processing evals for stopped jobs is way less work than for pending jobs so why not try to do stopped first in all but the most pathological DoS situations?
I agree with this. While arbitrarily bumping priority of de-reg evals is an interesting idea to try to get stopped jobs processed first, I think this is a much safer approach. Let operators "skip the line" when desired, but otherwise use the logic proposed in this PR because it is intuitively what users will expect. I'll open an issue for that enhancement. |
…rrectly rpc: set the deregistration eval priority to the job priority.
…rrectly rpc: set the deregistration eval priority to the job priority.
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
Previously when creating an eval for job deregistration, the eval
priority was set to the default value irregardless of the job
priority. In situations where an operator would want to deregister
a high priority job so they could re-register; the evaluation may
get blocked for some time on a busy cluster because of the
deregsiter priority.
If a job had a lower than default priority and was deregistered,
the deregister eval would get a priority higher than that of the
job. If we attempted to register another job with a higher
priority than this, but still below the default, the deregister
would be actioned before the register.
Both situations described above seem incorrect and unexpected from
a user perspective.
This fix modifies to behaviour to set the deregister eval priority
to that of the job, if available. Otherwise the default value is
still used. This is the same behaviour found within
BatchDeregister
.