Skip to content

Commit

Permalink
[MultiKueue] Document MPI Operator managedBy field (kubernetes-sigs#3316
Browse files Browse the repository at this point in the history
)

* Add MPIJob MultiCluster docs

* Include managedBy feature

* Update after code review
  • Loading branch information
mszadkow authored and PBundyra committed Nov 5, 2024
1 parent 779b56a commit 26ce7c5
Show file tree
Hide file tree
Showing 2 changed files with 43 additions and 0 deletions.
7 changes: 7 additions & 0 deletions site/content/en/docs/tasks/run/multikueue/kubeflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,3 +31,10 @@ See [Training Operator Installation](https://www.kubeflow.org/docs/components/tr
## MultiKueue integration

Once the setup is complete you can test it by running one of the Kubeflow Jobs e.g. PyTorchJob [`sample-pytorchjob.yaml`](/docs/tasks/run/kubeflow/pytorchjobs/#sample-pytorchjob).


## Working alongside MPI Operator
In order for MPI-operator and Training-operator to work on the same cluster it is required that:
1. `kubeflow.org_mpijobs.yaml` entry is removed from `base/crds/kustomization.yaml` - https://github.com/kubeflow/training-operator/issues/1930
2. Training Operator deployment is modified to enable all kubeflow jobs except for MPI - https://github.com/kubeflow/training-operator/issues/1777

36 changes: 36 additions & 0 deletions site/content/en/docs/tasks/run/multikueue/mpijob.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
---
title: "Run MPI Jobs in Multi-Cluster"
linkTitle: "MPIJob"
weight: 3
date: 2024-10-25
description: >
Run a MultiKueue scheduled MPI Jobs.
---

## Before you begin

Check the [MultiKueue installation guide](/docs/tasks/manage/setup_multikueue) on how to properly setup MultiKueue clusters.

For the proper setup and use it is required using at least Kueue v0.9.0 and for MPI Operator at least v0.6.0.

### Installation on the Clusters

{{% alert title="Note" color="primary" %}}
Note: While both MPI Operator and Training Operator must be running on the same cluster, there are special steps that has to be applied to Training Operator deployment.
See [Working alongside MPI Operator](/docs/tasks/run/multikueue/kubeflow#working-alongside-mpi-operator) for more details.
{{% /alert %}}

See [MPI Operator Installation](https://www.kubeflow.org/docs/components/training/user-guides/mpi/#installation) for installation and configuration details of MPI Operator.

## MultiKueue integration

Once the setup is complete you can test it by running a MPI Job [`sample-mpijob.yaml`](/docs/tasks/run/kubeflow/mpijobs/#sample-mpijob).

{{% alert title="Note" color="primary" %}}
Note: Kueue defaults the `spec.runPolicy.managedBy` field to `kueue.x-k8s.io/multikueue` on the management cluster for MPIJob.

This allows the MPI Operator to ignore the Jobs managed by MultiKueue on the management cluster, and in particular skip Pod creation.

The pods are created and the actual computation will happen on the mirror copy of the Job on the selected worker cluster.
The mirror copy of the Job does not have the field set.
{{% /alert %}}

0 comments on commit 26ce7c5

Please sign in to comment.