From 43e5eb8ee19fe69d3dd3f03c13f4ad46b77dd094 Mon Sep 17 00:00:00 2001 From: Marcin Wielgus Date: Wed, 1 May 2024 20:40:55 +0200 Subject: [PATCH] KEP 2076: Kueuectl (#2093) * KEP 2076: Kueuectl * Apply suggestions from code review Co-authored-by: Aldo Culquicondor <1299064+alculquicondor@users.noreply.github.com> Co-authored-by: Yuki Iwai * Fixed review comments --------- Co-authored-by: Aldo Culquicondor <1299064+alculquicondor@users.noreply.github.com> Co-authored-by: Yuki Iwai --- keps/2076-kueuectl/README.md | 347 +++++++++++++++++++++++++++++++ keps/2076-kueuectl/kep.yaml | 28 +++ keps/487-kubectl-plugin/kep.yaml | 2 +- 3 files changed, 376 insertions(+), 1 deletion(-) create mode 100644 keps/2076-kueuectl/README.md create mode 100644 keps/2076-kueuectl/kep.yaml diff --git a/keps/2076-kueuectl/README.md b/keps/2076-kueuectl/README.md new file mode 100644 index 0000000000..075aaeedd3 --- /dev/null +++ b/keps/2076-kueuectl/README.md @@ -0,0 +1,347 @@ +# KEP-2076: Kueuectl + + +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories (Optional)](#user-stories-optional) + - [Story 1](#story-1) + - [Story 2](#story-2) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Create ClusterQueue](#create-clusterqueue) + - [Create LocalQueue](#create-localqueue) + - [List ClusterQueue](#list-clusterqueue) + - [List LocalQueue](#list-localqueue) + - [List Workloads](#list-workloads) + - [Stop ClusterQueue](#stop-clusterqueue) + - [Resume ClusterQueue](#resume-clusterqueue) + - [Stop LocalQueue](#stop-localqueue) + - [Resume LocalQueue](#resume-localqueue) + - [Stop Workload](#stop-workload) + - [Resume Workload](#resume-workload) + - [Pass-through](#pass-through) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit Tests](#unit-tests) + - [Integration tests](#integration-tests) + - [Graduation Criteria](#graduation-criteria) +- [Implementation History](#implementation-history) +- [Alternatives](#alternatives) + + +## Summary + +We want to create a command line tool for Kueue that allows to: + +* list Kueue's objects with easy to use Kueue-specific filtering, +* create Local and ClusterQueues without writing yamls, +* perform management operations on LQs, CQs, Workloads, ResourceFlavors and other Kueue objects. + +## Motivation + +Currently many administrative operations around Kueue are largely inconvenient. +They require full API understanding, are relatively error-prone or are simply +tedious without writing a custom mini script or complex pipe processing. + +### Goals + +* Provide a command line tool for system administrator to: + + * Create ClusterQueues and LocalQueues. + * Listing Queues and Workloads that meet certain criteria. + * Stopping and resuming execution in ClusteQueues and LocalQueues. + * Stopping and resuming individual Workloads. + * (In the future) Migrating workloads between LocalQueues and other avanced operations + +* Build it on top of kubectl (as a [kubectl plugin](https://kubernetes.io/docs/tasks/extend-kubectl/kubectl-plugins/)) to reuse all of +the authentication/cluster selection methods. + +* Expose the plugin in Krew. + +### Non-Goals + +* Provide any other interface than a command line (no web ui). +* Provide a tool targeted at ml researchers to make running jobs on Kubernetes easier. +* Expose additional metrics or statuses. + +## Proposal + +Create kueue kubectl plugin with a set of listing and management commands in form of: + +``` +kubectl kueue +``` + +Additionally provide a wrapper script to allow shorter syntax like: + +``` +kueuectl +``` + +The commands automatically submit all changes unless `--dry-run` option is given - in that +case the tool will print out yamls without making any changes in the cluster (we may +submit the yaml with dryrun enabled to the APIServer to get validation error). + +### User Stories (Optional) + +#### Story 1 + +I want to stop admission of new Workloads on a specific ClusterQueue but allow the already +running Workloads to complete without manually editing the CQ's definition. + +#### Story 2 + +I want to create a LocalQueue pointing to a specific CQ without creating a one-time yaml. + +### Risks and Mitigations + +* There will be an additional binary placed on sysadmin's machine, whose version will +have to be keep in sync. + +## Design Details + +The following commands will be provided within the plugin: + +### Create ClusterQueue + +Creates a ClusterQueue with the given name, cohort, specified quota and other details. +Format: + +``` +kueuectl create cq|clusterqueue cqname +–-cohort=cohortname # defaults to "" - no cohort + +--queuing-strategy=strategy # defaults to BestEffortFIFO +--namespace-selector=selector # defaults to {} - all namespaces can use the queue +--reclaim-within-cohort=policy # defaults to Never +--preemption-within-cluster-queue = policy # defaults to Never + +–-nominal-quota=rfname1:resource1=value,resource2=value,resource3=value +–-borrowing-limit=rfname1:resource1=value,resource2=value,resource3=value +–-lending-limit=rfname1:resource1=value,resource2=value,resource3=value +``` + +It is possible to create a ClusterQueue with multiple resource flavors/FlavorQuotas inside +a ResourceGroup and multiple ResourceGroups covering different sets of resources. +The command will create the appropriate resource groups with resource flavors +in the order they appear in the command line. If two settings have at least +one common resource quota specified, they will land in the same ResourceGroup. + +Output: +A simple confirmation, like in regular kubectl create, +`clusterqueue.kueue.x-k8s.io/xxxxx created` + + +### Create LocalQueue + +Creates a LocalQueue with the given name pointing to specified ClusterQueue. +The command validates that the target ClusterQueue exists and +its namespace selector matches to LocalQueue's namespace. + +Format: + +``` +kueuectl create lq|localqueue lqname +–-namespace=namespace # uses context's default namespace if not specified +–-clusterqueue=cqname +--ignore-unknown-cq +``` + +Output: +A simple confirmation `localqueue.kueue.x-k8s.io/xxxxx created` + +### List ClusterQueue + +List all ClusterQueues, potentially limiting output to those that are active/inactive and +matching the label selector. Format: + +``` +kueuectl list cq|clusterqueue(s) +--active=*|true|false +--selector=selector # label selector +``` + +Output columns: + +* Name +* Cohort +* Pending Workloads +* Admitted Workloads +* Status (active or not) +* Age + +### List LocalQueue + +Lists LocalQueues that match the given criteria: point to a specific CQ, +being active/inactive, belonging to the specified namespace or matching the label +selector. +Format: + +``` +kueuectl list lq|localqueue(s) +–-namespace=ns # uses context's default namespaces if not specified +--all-namespaces | -A +-–clusterqueue=clusterqueue +–-active=*|true|false +--selector=selector # label selector +``` + +Outputs columns: + +* Namespace (if -A is used) +* Name +* ClusterQueue +* Pending Workloads +* Admitted Workloads +* Status (active or not) +* Age + +### List Workloads + +Lists Workloads that match the provided criteria. Format: + +``` +kueuectl list workloads +--namespace=ns # uses context's default namespace if not specified +--all-namespaces | -A +--clusterqueue=cq +-–localqueue=lq +-—only-pending +—-only-admitted +--selector=selector +``` + +Output: + +* Namespace (if -A is used) +* Workload name +* CRD type (truncated to 10 chars) +* CRD name +* LocalQueue +* ClusterQueue +* Status +* Position in Queue (if Pending) +* Age + + +### Stop ClusterQueue + +Stops admission and execution inside the specified ClusterQueue, possibly +limiting the action only to the selected ResourceFlavor. +Format: + +``` +kueuectl stop clusterqueue|cq cqname +--keep-already-running +--resource-flavor=rfnam # Requires additional API. +``` + +Output: +None. + +### Resume ClusterQueue + +Resumes admission inside the specified ClusterQueue. +Format: +``` +kueuectl resume clusterqueue|cq cqname +-–resource-flavor=rfname # Requires additional API. +``` +Output: +None. + +### Stop LocalQueue + +Stops execution (or just admission) of Workloads coming from the given LocalQueue. +This requires adding StopPolicy to LocalQueue and enforcing its changes in ClusterQueue (#2109). +Format: +``` +kueuectl stop localqueue|lq lqname +--keep-already-running +``` +Output: +None. + + +### Resume LocalQueue + +Resumes admission of Workloads coming from the given LocalQueue. +Format: +``` +kueuectl resume localqueue|lq lqname +``` +Output: +None. + +### Stop Workload + +Puts the given Workload on hold. The Workload will not be admitted and +if it is already admitted it will be put back to queue just as if it was preempted +(using `.spec.active` field). + +Format: +``` +kueuectl stop workload name --namespace=ns +``` +Output: +None. + + +### Resume Workload + +Resumes the Workload, allowing its admission according to regular ClusterQueue rules. +Format: +``` +kueuectl resume workload name --namespace=ns +``` +Output: +None. + +### Pass-through + +For completeness there will be 4 additional commands that will simply execute regular kubectl +so that the users won't have to remember to switch the command to kubectl. + +* `delete workload|clusterqueue|cq|localqueue|lq` +* `get workload|clusterqueue|cq|localqueue|lq` +* `edit workload|clusterqueue|cq|localqueue|lq` +* `describe workload|clusterqueue|cq|localqueue|lq` + +### Test Plan + +[x] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + +#### Unit Tests + +Regular unit tests for the commands will be provided. + +#### Integration tests + +Integration/E2E tests will be provided for all of the commands. + +### Graduation Criteria + +Beta: +* Positive feedback from users. +* All bugs and issues fixed. + +GA/Stable: +* Positive feedback from users +* No request for column/flags changes for 0.5 year. + +## Implementation History + +KEP: 2023-04-27. + +## Alternatives + +* Use existing kubectl functionality and perform management operations via +API manipulations. +* Don't use kubectl plugins but write CLI from scratch. \ No newline at end of file diff --git a/keps/2076-kueuectl/kep.yaml b/keps/2076-kueuectl/kep.yaml new file mode 100644 index 0000000000..a0e56ab5ec --- /dev/null +++ b/keps/2076-kueuectl/kep.yaml @@ -0,0 +1,28 @@ +title: Kueuectl +kep-number: 2076 +authors: + - "@mwielgus" +status: implementable +creation-date: 2024-04-25 +reviewers: + - "@alculquicondor" +approvers: + - "@alculquicondor" +replaces: + - "/keps/487-kubectl-plugin" + +# The target maturity stage in the current dev cycle for this KEP. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v0.8" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v0.8" + beta: "v0.9" + +disable-supported: false + diff --git a/keps/487-kubectl-plugin/kep.yaml b/keps/487-kubectl-plugin/kep.yaml index b090acedf5..82a3f99420 100644 --- a/keps/487-kubectl-plugin/kep.yaml +++ b/keps/487-kubectl-plugin/kep.yaml @@ -2,7 +2,7 @@ title: Kubectl plugin for listing objects kep-number: 487 authors: - "@vsoch" -status: provisional +status: replaced # Replaced by KEP #2076 creation-date: 2023-07-11 reviewers: - "@tenzen-y"