-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add "Frozen" ClusterQueue state #134
Comments
/assign |
ToDoList: |
We only need to wait for running workloads to finish. We can do this as follows: 1) change the CQ status to suspended; this will prevent any new workloads from being admitted 2) wait until existing admitted workloads finish.
The admin will have to delete admitted running workloads manually, I don't think we can do that on their behalf.
I think this should be the first PR, or you can merge them together.
I think we need one state, I would change pending to suspended and have different reasons based on the why the CQ is in that state.
It will based on the logic I described above, we don't care about workloads that are not yet admitted, once the CQ is actually deleted, they will get their status updated that the CQ doesn't exist and users can make a decision what to do with them. |
Note that the Assumed state is temporary. A workload is assumed when the scheduler decides that it should fit and then it's Forgotten if the API call fails. So it's completely fine to wait for these workloads to either be admitted and finished or to get removed from the clusterqueue. In all scenarios, all that should matter is whether the cache's ClusterQueue is empty. |
The admin is aware of the status of all the workloads, when he wants to delete the clusterQueue, he should be responsible for the results(we will only wait for the admitted workloads to complete). Oppositely, if he wants to delete the clusterQueue, but stuck in terminating for several unadmitted workloads pending for special reasons, it doesn't make sense. Of course he can delete the unadmitted workloads manually, but if we have hundreds of unadmitted workloads, it's struggle. |
What are you proposing? that Kueue deletes running workloads? if so, I don't think we can do that, we should be conservative when handling user workloads, deleting the workloads is super aggressive and may not be what the admin/user wants.
A tool can be created for that; we plan to have a kubectl plugin for kueue, such a feature can be created there. |
My opinion is the same as yours actually. |
with #284 merged, I think this one is done. We still need to add the finalizers in the webhook though. Which are now tracked in separate issues. /close |
@ahg-g: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Introduce a new "OutOfCommision/Frozen" state for ClusterQueue. A ClusterQueue can enter this state in the following cases:
ClusterQueues in this state should be taken out of the cohort and no jobs can schedule via them until the referenced flavors are defined.
The text was updated successfully, but these errors were encountered: