The multi-cluster-app-dispatcher is a Kubernetes controller providing mechanisms for applications to manage batch jobs in a single or multi-cluster environment. For more details please refer here.
MCAD allows you to deploy Ray cluster with a guarantee that sufficient resources are available in the cluster prior to actual pod creation in the Kubernetes cluster. It supports features such as:
- Integrates with upstream Kubernetes scheduling stack for features such co-scheduling, Packing on GPU dimension etc.
- Ability to wrap any Kubernetes objects.
- Increases control plane stability by JIT (Just-in Time) object creation.
- Queuing with policies.
- Quota management that goes across namespaces.
- Support for multiple Kubernetes clusters; dispatching jobs to any one of a number of Kubernetes clusters.
In order to queue Ray cluster(s) and gang dispatch
them when aggregated resources are available please create a KinD cluster using the instruction below and then refer to the setup KubeRay-MCAD integration on a Kubernetes Cluster or an OpenShift Cluster.
On OpenShift, MCAD and KubeRay are already part of the Open Data Hub Distributed Workload Stack. The stack provides a simple, user-friendly abstraction for scaling, queuing and resource management of distributed AI/ML and Python workloads. Please follow the Quick Start in the Distributed Workloads for installation.
We need a KinD cluster with the specified cluster resources to consistently observe the expected behavior described in the demo below. This can be done with running KinD with Podman.
Note: Without Podman, a KinD worker node is allowed to see the cpu/memory resources on the host. In addition, this environment is created to run the tutorial on a resource-constrained local Kubernetes environment. It is not recommended for real workloads or production.
podman machine init --cpus 8 --memory 8196
podman machine start
podman machine list
Expect the Podman Machine running with the follow CPU and MEMORY resources
NAME VM TYPE CREATED LAST UP CPUS MEMORY DISK SIZE
podman-machine-default* qemu 2 minutes ago Currently running 8 8.594GB 107.4GB
Create KinD cluster on the Podman Machine:
KIND_EXPERIMENTAL_PROVIDER=podman kind create cluster
Creating a KinD cluster should take less than 1 minute. Expect the output similar to:
using podman due to KIND_EXPERIMENTAL_PROVIDER
enabling experimental podman provider
Creating cluster "kind" ...
✓ Ensuring node image (kindest/node:v1.26.3) 🖼
✓ Preparing nodes 📦
✓ Writing configuration 📜
✓ Starting control-plane 🕹️
✓ Installing CNI 🔌
✓ Installing StorageClass 💾
Set kubectl context to "kind-kind"
You can now use your cluster with:
kubectl cluster-info --context kind-kind
Have a nice day! 👋
Describe the single node cluster:
kubectl describe node kind-control-plane
Expect the cpu
and memory
in the Allocatable
section to be similar to:
Allocatable:
cpu: 8
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 8118372Ki
pods: 110
After the KinD cluster is created using the instruction above, make sure to install the KubeRay-MCAD integration Prerequisites for KinD cluster.
Let's create two RayClusters using the AppWrapper custom resource(CR) on the same Kubernetes cluster. The AppWrapper is the custom resource definition provided by MCAD to dispatch resources and manage batch jobs on Kubernetes clusters.
-
We submit the first RayCluster with the AppWrapper CR aw-raycluster.yaml:
kubectl create -f https://raw.githubusercontent.com/project-codeflare/multi-cluster-app-dispatcher/main/doc/usage/examples/kuberay/config/aw-raycluster.yaml
In the above AppWrapper CR, we wrapped an example of RayCluster CR in the
generictemplate
. We also specified matching resources for each of the RayCluster Head node and worker node in thecustompodresources
. The MCAD uses thecustompodresources
to reserve the required resources to run the RayCluster without creating pending Pods.Note: Within the same AppWrapper, you may also wrap any individual k8s resources (i.e. configMap, secret, etc) associated with this job as a generictemplate to be dispatched together with the RayCluster.
Check AppWrapper status by describing the job.
kubectl describe appwrapper raycluster-complete -n default
The
Status:
stanza would show theState
ofRunning
if the wrapped RayCluster has been deployed. The 2 Pods associated with the RayCluster were also created.Status: Canrun: true Conditions: Last Transition Micro Time: 2023-08-29T02:50:18.829462Z Last Update Micro Time: 2023-08-29T02:50:18.829462Z Status: True Type: Init Last Transition Micro Time: 2023-08-29T02:50:18.829496Z Last Update Micro Time: 2023-08-29T02:50:18.829496Z Reason: AwaitingHeadOfLine Status: True Type: Queueing Last Transition Micro Time: 2023-08-29T02:50:18.842010Z Last Update Micro Time: 2023-08-29T02:50:18.842010Z Reason: FrontOfQueue. Status: True Type: HeadOfLine Last Transition Micro Time: 2023-08-29T02:50:18.902379Z Last Update Micro Time: 2023-08-29T02:50:18.902379Z Reason: AppWrapperRunnable Status: True Type: Dispatched Controllerfirsttimestamp: 2023-08-29T02:50:18.829462Z Filterignore: true Queuejobstate: Dispatched Sender: before manageQueueJob - afterEtcdDispatching State: Running Events: <none> (base) asmalvan@mcad-dev:~/mcad-kuberay$ kubectl get pod -n default NAME READY STATUS RESTARTS AGE raycluster-complete-head-9s4x5 1/1 Running 0 47s raycluster-complete-worker-small-group-4s6jv 1/1 Running 0 47s
-
Let's submit another RayCluster with the AppWrapper CR and see it queued without creating pending Pods using the command:
kubectl create -f https://raw.githubusercontent.com/project-codeflare/multi-cluster-app-dispatcher/main/doc/usage/examples/kuberay/config/aw-raycluster-1.yaml
Check the raycluster-complete-1 AppWrapper
kubectl describe appwrapper raycluster-complete-1 -n default
The
Status:
stanza should show theState
ofPending
if the wrapped object (RayCluster) has been queued. No pods from the secondAppWrapper
were created due toInsufficient resources to dispatch AppWrapper
.Status: Conditions: Last Transition Micro Time: 2023-08-29T17:39:08.406401Z Last Update Micro Time: 2023-08-29T17:39:08.406401Z Status: True Type: Init Last Transition Micro Time: 2023-08-29T17:39:08.406452Z Last Update Micro Time: 2023-08-29T17:39:08.406451Z Reason: AwaitingHeadOfLine Status: True Type: Queueing Last Transition Micro Time: 2023-08-29T17:39:08.423208Z Last Update Micro Time: 2023-08-29T17:39:08.423208Z Reason: FrontOfQueue. Status: True Type: HeadOfLine Last Transition Micro Time: 2023-08-29T17:39:08.439753Z Last Update Micro Time: 2023-08-29T17:39:08.439753Z Message: Insufficient resources to dispatch AppWrapper. Reason: AppWrapperNotRunnable. Status: True Type: Backoff Controllerfirsttimestamp: 2023-08-29T17:39:08.406399Z Filterignore: true Queuejobstate: Backoff Sender: before ScheduleNext - setHOL State: Pending Events: <none>
We may manually check the allocated resources:
kubectl describe node kind-control-plane
The Allocated resources
section showed cpu Requests as 6050m(75%) therefore the remaining cpu resource did not satisfy the second AppWrapper.
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 6050m (75%) 5200m (65%)
memory 6824650Ki (84%) 6927050Ki (85%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Dispatching policy out of the box is FIFO which can be augmented as per user needs. The second RayCluster will be dispatched when additional aggregated resources are available in the cluster or the first AppWrapper is deleted.
For example, observe the other RayCluster been created after deleting the first AppWrapper using:
kubectl delete appwrapper raycluster-complete -n default
Note: This would also simultaneously remove any K8s resources you may have wrapped as generictemplates within this AppWrapper.