MLBatch

This repository describes the setup and use of the MLBatch queuing and quota management system on OpenShift and Kubernetes clusters. MLBatch leverages Kueue, the Kubeflow Training Operator, KubeRay, and the Codeflare Operator from Red Hat OpenShift AI. MLBatch enables AppWrappers and adds Coscheduler. MLBatch includes a number of configuration steps to help these components work in harmony and support large workloads on large clusters.

MLBatch handles the queuing and dispatching of batch workloads on OpenShift and Kubernetes clusters. It enforces team quotas at the namespace level. It automates the borrowing and reclamation of unused quotas across teams. Teams can use priorities within their namespaces without impact on other teams. Using AppWrappers to submit workloads activates a number of fault detection and recovery capabilities, including automatically detecting failed pods and automatically retrying failed workloads. Coscheduler supports gang scheduling and minimizes fragmentation by preferentially packing jobs requiring less than a full node's worth of GPUs together.

Cluster Setup

To learn how to setup MLBatch on a cluster and onboard teams see SETUP.md.

Quota maintenance is a key aspect of smoothly administering an MLBatch cluster. Cluster admins should carefully read QUOTA_MAINTENANCE.md.

Running Workloads

To learn how to run workloads on an MLBatch cluster see USAGE.md or CODEFLARE.md if you are already familiar with the CodeFlare stack for managing AI/ML workloads on Kubernetes.

PyTorchJobs via the MLBatch Helm Chart

Properly configuring a distributed PyTorchJob to make effective use of the MLBatch system and hardware accelerators (GPUs, RoCE GDR) can be tedious. To automate this process, we provide a Helm chart that captures best practices and common configuration options. Using this Helm chart helps eliminate common mistakes. Please see pytorchjob-generator for detailed usage instructions.

Development Setup

If you will be contributing to the development of the MLBatch project, you must setup precommit hooks for your local clone of the repository. Do the following once, immediately after cloning this repo:

helm plugin install https://github.com/helm-unittest/helm-unittest.git
pre-commit install

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Name		Name	Last commit message	Last commit date
Latest commit History 110 Commits
.github/workflows		.github/workflows
figures		figures
samples		samples
scheduler-plugins @ 96a3366		scheduler-plugins @ 96a3366
setup.RHOAI-v2.10		setup.RHOAI-v2.10
setup.RHOAI-v2.13		setup.RHOAI-v2.13
setup.RHOAI-v2.15		setup.RHOAI-v2.15
setup.k8s-v1.27		setup.k8s-v1.27
setup.k8s-v1.30		setup.k8s-v1.30
setup.tmpl		setup.tmpl
tools		tools
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
CODEFLARE.md		CODEFLARE.md
LICENSE		LICENSE
QUOTA_MAINTENANCE.md		QUOTA_MAINTENANCE.md
README.md		README.md
SETUP.md		SETUP.md
USAGE.md		USAGE.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MLBatch

Cluster Setup

Running Workloads

PyTorchJobs via the MLBatch Helm Chart

Development Setup

License

About

Releases 7

Packages

Contributors 2

Languages

License

project-codeflare/mlbatch

Folders and files

Latest commit

History

Repository files navigation

MLBatch

Cluster Setup

Running Workloads

PyTorchJobs via the MLBatch Helm Chart

Development Setup

License

About

Resources

License

Stars

Watchers

Forks

Releases 7

Packages 0

Contributors 2

Languages

Packages