Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent ClusterQueue with pending workloads to block borrowing in other flavors in the cohort #1036

Closed
3 tasks
alculquicondor opened this issue Aug 2, 2023 · 1 comment · Fixed by #1039
Closed
3 tasks
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@alculquicondor
Copy link
Contributor

What would you like to be added:

In #475 and #805, we guaranteed the following:
If a ClusterQueue has pending workloads that could fit in the nominalQuota if other workloads finish (regardless of whether preemption is enabled), we don't admit other workloads in the cohort that can fit by borrowing. This allows the pending workloads to be admitted as soon as the quota becomes available without needing to trigger extra preemptions (or waiting for more jobs to finish, if preemption is disabled).

However, the check is too coarse: if the borrowing is happening in a different flavor, we shouldn't block it.

A possible solution is to track not just which cohorts already had workloads admitted in a cycle, but also which combinations of (flavor, resource).

Current check per cohort:

if cq.Cohort != nil {
// Having more than one workloads from the same cohort admitted in the same scheduling cycle can lead
// to over admission if:
// 1. One of the workloads is borrowing, since during the nomination the usage of the other workloads
// evaluated in the same cycle is not taken into account.
// 2. An already admitted workload from a different cluster queue is borrowing, since all workloads
// evaluated in the current cycle will compete for the resources that are not borrowed.
if usedCohorts.Has(cq.Cohort.Name) && (e.assignment.Borrows() || cq.Cohort.HasBorrowingQueues()) {
e.status = skipped
e.inadmissibleMsg = "other workloads in the cohort were prioritized"
continue
}
// Even if there was a failure, we shouldn't admit other workloads to this
// cohort.
usedCohorts.Insert(cq.Cohort.Name)
}

Why is this needed:

As users start setting up complex ClusterQueues, with multiple flavors and multiple CQs per cohort, the current check doesn't match user expectations, specially if low priority pending workloads in a ClusterQueue could end up blocking other ClusterQueue.

Completion requirements:

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

@alculquicondor alculquicondor added the kind/feature Categorizes issue or PR as related to a new feature. label Aug 2, 2023
@alculquicondor
Copy link
Contributor Author

/assign @trasc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants