Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can we preempt in more than one CQ per cohort in a cycle? #2596

Closed
Tracked by #2628
alculquicondor opened this issue Jul 12, 2024 · 5 comments · Fixed by #2641
Closed
Tracked by #2628

Can we preempt in more than one CQ per cohort in a cycle? #2596

alculquicondor opened this issue Jul 12, 2024 · 5 comments · Fixed by #2641
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@alculquicondor
Copy link
Contributor

Note: opening this more of as a discussion, rather than something that can be worked on right away.

If a CQ needs a preemption, and such preemption is taking a long time (because of the grace period), then this blocks other preemptions in the same cohort:

// Check whether there was an assignment in this cycle that could render the next assignments invalid:
// - If the workload no longer fits in the cohort.
// - If there was another assignment in the cohort, then the preemption calculation is no longer valid.
if cycleCohortsUsage.hasCommonFlavorResources(cq.Cohort.Name, e.assignment.Usage) &&
((mode == flavorassigner.Fit && !cq.FitInCohort(sum)) ||
(mode == flavorassigner.Preempt && cycleCohortsSkipPreemption.Has(cq.Cohort.Name))) {

Can we take a more optimistic approach and allow other CQs to preempt?

A problematic scenario could be as follows:

  • Workload a from CQ A needs to preempt x, y.
  • Workload b from CQ B needs to preempt y and z.

Now x, y and z are successfully preempted, so Workload a can fit. But workload b cannot longer fit, because a took the space of y. There are 2 possibilities:

  • b can preempt something else, in which case the situation should resolve by itself in the next iterations.
  • b can't preempt anything else, in which case we preempted z unnecessarily.
@alculquicondor alculquicondor added the kind/bug Categorizes issue or PR as related to a bug. label Jul 12, 2024
@alculquicondor
Copy link
Contributor Author

alculquicondor commented Jul 12, 2024

Maybe we can allow more preemptions, as long as each workload is preempting different workloads. But my gut feeling is that, in most scenarios, multiple CQs will be trying to preempt the same workloads.

Another avenue could be to combine the above suggestion with excluding terminating workloads from the preemption target calculations. This way, each CQ would try to preempt workloads that were not chosen by other CQs already. But we could be at risk of doing unnecessary preemptions if one of the workloads that is stuck in termination is big enough to make space for multiple workloads that needed preemption.

@alculquicondor
Copy link
Contributor Author

cc @gabesaba

@alculquicondor
Copy link
Contributor Author

Maybe we can allow more preemptions, as long as each workload is preempting different workloads.

Let's start with this, as it can help in the cases when multiple CQs are preempting their own workloads.

/assign @gabesaba

@gabesaba
Copy link
Contributor

Maybe we can allow more preemptions, as long as each workload is preempting different workloads.

Let's start with this, as it can help in the cases when multiple CQs are preempting their own workloads.

/assign @gabesaba

I think that there is a problematic case even when the sets of workloads preempted are disjoint - suppose both CQs need some remaining FlavorResource capacity, plus whatever capacity is reclaimed from the preemption. If the sum of FlavorResource capacity needed by both CQs is more than the Cohort has available, then only one can be admitted, and we did a wasteful preemption.

@gabesaba
Copy link
Contributor

Maybe we can allow more preemptions, as long as each workload is preempting different workloads.

Let's start with this, as it can help in the cases when multiple CQs are preempting their own workloads.
/assign @gabesaba

I think that there is a problematic case even when the sets of workloads preempted are disjoint - suppose both CQs need some remaining FlavorResource capacity, plus whatever capacity is reclaimed from the preemption. If the sum of FlavorResource capacity needed by both CQs is more than the Cohort has available, then only one can be admitted, and we did a wasteful preemption.

handled this case in #2641

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants