KEP-2170: Kubeflow Training V2 API

Authors

Andrey Velichkevich - @andreyvelich
Yuki Iwai - @tenzen-y

Creation date: 2024-07-16

Overview

This document discusses the new Kubeflow Training V2 API.

When we built the Kubeflow Training Operator a couple of years ago, Kubernetes lacked better features to support distributed machine learning (ML) training, such as SuccessPolicy and RestartPolicy (FailurePolicy). Recently, the Kubernetes community launched the working group Batch, and then the working group actively worked on evolving the batch/v1 Job API and built a new Kubernetes SIGs project: JobSet to manage groups of Jobs.

This document consolidates efforts for the Cloud Native ML Training between Kubeflow and Kubernetes communities.

Motivation

We often implement features similar to batch/v1 Job, such as “suspend”, on the Training Operator side since the Training Operator creates blocks of plain Pod and Service for each rank once Kubeflow Jobs are created. However, if we continue taking the same approach to use lowest level abstractions that introduce redundancy, the maintenance costs will continue to increase.

Replacing repetitive infrastructure layers with JobSet would help to avoid redundancy and reduce developer toil.

Additionally, introducing JobSet as an infrastructure layer would allow us to introduce batch workload features such as the PodFailurePolicy and the PodDisruptionCondition easily.

Please also see the Kubernetes JobSet and Kubeflow Training Operator collaboration document.

User Value

In addition to the above motivation, we will address the following user feedback while implementation:

Confusion around Workers: #1790
Support batch/v1 Job features: #1718
ExitCodes for PodFailurePolicy: #1749
Migrate to MPI V2 API: #1906

Personas

We can identify the following personas of Training Operator:

DevOps Engineer. They are familiar with Kubernetes concepts and they know how to manage the Kubernetes workloads. Usually, they are not experts in ML frameworks and ML algorithms.
MLOps Engineer. They are familiar with ML frameworks and they know how to configure distributed PyTorch settings such as rendezvous backends or MPI configuration. Usually, they are not experts in Kubernetes and ML algorithms.
Data Scientists. They create model architectures and advanced ML algorithms to train models. They prefer to use Python for their work. They are aware of torch.nn APIs, but not with torch.distributed and Kubernetes concepts to scale model training.

Based on the above personas, we should build an API that everyone will benefit from.

Goals

Introduce the TrainingRuntime and ClusterTrainingRuntime APIs that will store blueprints for model training and LLM fine-tuning using various ML frameworks. These runtimes will be built on top of JobSet APIs with additional functionality for special use-cases. For example, training using MPI orchestration.
Introduce Kubeflow TrainJob API that allows to reuse these runtimes and quickly start a new training job without understanding complex Kubernetes APIs.
Update Kubeflow Training SDK to allow data scientists quickly create and monitor TrainJobs.
Create community-supported ClusterTrainingRuntime for distributed training with PyTorch and MPI.
Create community-supported ClusterTrainingRuntime for LLM fine-tuning for various foundational models (e.g. Mistral, LLama-70b, Gemma-7b).
Work on the following JobSet improvements: kubernetes-sigs/jobset#463 and kubernetes-sigs/jobset#572

Non-Goals

Support MPI V1 implementation.
Distributed training for TensorFlow, XGboost, JAX, and PaddlePaddle will be added after initial implementation.
Migrate Kubeflow V1 controller to use JobSet.

Design Details

We propose these APIs:

TrainJob: A single API which allows data scientists to initiate a training and fine-tuning job from the pre-deployed training runtime. It allows users to tweak configurations for their training jobs such as model parameters, dataset parameters, or trainer configuration. The main goal is to hide unnecessary Kubernetes complexity for data scientists.
TrainingRuntime and ClusterTrainingRuntime: Set of blueprints for how to start various types of training or fine-tuning jobs. Runtimes are managed by Platform Engineers and allow them to configure infrastructure parameters that are required for the TrainJob. For example, failure policy or gang-scheduling.

User Roles Diagram

The below diagram shows how platform engineers manage TrainingRuntime and how data scientists create TrainJob:

TrainJob can be created using kubectl or Kubeflow Python SDK.

LLM Fine-Tuning Diagram

The below diagram shows which resources will be created for LLM fine-tuning with PyTorch:

Worker and Node Definition

To better understand what does Nodes and Worker mean in the diagram above, the following table explains naming that each framework or technology uses:

ML Framework or Technology	Definition of a Single Device (GPU)	Definition of a Single VM	Start Command	Reference Docs
Kubernetes	Container Resource Unit	Pod’s Container	Any	Resource units in K8s
PyTorch	Worker `(--nproc-per-node)`	Node `(--nnodes)`	`torchrun`	PyTorch Elastic
MPI (OpenMPI)	Slot `(--np)`	Node `(--host)`	`mpirun`	Reference for OpenMPI
TensorFlow	Worker	Worker Pool Cluster Spec	`python`	TensorFlow Distributed
Jax	Process `jax.local_devices()`	Host `jax.devices()`	`python` or `mpirun`	Jax Distributed
PaddlePaddle	Worker	Node	`python -m paddle.distributed.launch`	Paddle Distributed
XGBoost	Worker	Not Applicable	`python`	Rabit Tracker for c10d
DeepSpeed	Slot	Node `(--num_nodes)`	`deepspeed`	DeepSpeed Distributed

Additionally, check this document for the mpirun command for other MPI implementations: Intel MPI, MPICH, Spectrum MPI.

The TrainJob API

The TrainJob exposes APIs that data scientist can override in TrainingRuntime to create training job:

type TrainJob struct {
	metav1.TypeMeta   `json:",inline"`
	metav1.ObjectMeta `json:"metadata,omitempty"`

	// Spec defines the desired state of TrainJob.
	Spec TrainJobSpec `json:"spec"`

	// Status defines the current state of TrainJob.
	Status TrainJobStatus `json:"status,omitempty"`
}

type TrainJobSpec struct {
	// Reference to the Training Runtime.
	TrainingRuntimeRef *TrainingRuntimeRef `json:"trainingRuntimeRef"`

	// Parameters that data scientists can override
	Trainer *Trainer `json:"trainer,omitempty"`

	// Configuration for training dataset
	DatasetConfig *DatasetConfig `json:"datasetConfig,omitempty"`

	// Configuration for the pre-trained model and location for model output
	ModelConfig *ModelConfig `json:"modelConfig,omitempty"`

	// Custom metadata to apply for Job, JobSet, etc.
	Labels      map[string]string `json:"labels,omitempty"`
	Annotations map[string]string `json:"annotations,omitempty"`
}

type TrainingRuntimeRef struct {
	// Name for the training runtime.
	Name string `json:"name"`
	// Namespace for the runtime.
	// If namespace is set, TrainingRuntime is used. Otherwise, ClusterTrainingRuntime is used.
	Namespace string `json:"namespace,omitempty"`
}

type TrainJobStatus struct {
	// Conditions for the TrainJob
	Conditions []metav1.Condition `json:"conditions,omitempty"`
}

This table explain rationale for each TrainJob parameter:

Parameter	What is it ?
`TrainingRuntimeRef`	Reference to the existing `TrainingRuntime` that pre-deployed by platform engineers
`Trainer`	Configuration for the Trainer such as image, number of nodes, accelerators.
`ModelConfig`	Configuration for the pre-trained model and location for model output
`DatasetConfig`	Configuration for the dataset that will be used to train or fine-tune model
Labels and Annotations	Custom metadata that needs to be applied to the `TrainJob` resources: JobSet, Job, Pods.
`PodSpecOverrides`	Custom overrides that are specific to the `TrainJob` and need to be applied to the `TrainJob` resources. For example, the user identity. Usually, it is managed by custom admission webhooks that inject data to the `TrainJob` after user creates it via Python SDK or `kubectl`

Example of TrainJob

apiVersion: kubeflow.org/v2alpha1
kind: TrainJob
metadata:
  name: torch-ddp
  namespace: tenant-alpha
spec:
  trainingRuntimeRef:
    name: torch-distributed-multi-node
  trainer:
    image: docker.io/custom-training
    command:
      - torchrun train.py
    numNodes: 5
    resourcesPerNode:
      requests:
        nvidia.com/gpu: 2

The above command will be converted as follows:

torchrun --nnodes=5 --nproc-per-node=2 train.py

Additionally, the Kubeflow Training SDK allows to create the above TrainJob using the Python API:

def train_func():
    import torch
    class Net(torch.nn.Module):
        """Create the PyTorch Model"""
        ...
    model = Net()

    # Attach model to the distributor
    torch.distributed.init_process_group(backend="nccl")
    model = torch.nn.parallel.DistributedDataParallel(model)

    # Train model
    model.train()

# Use Kubeflow SDK to create TrainJob.
from kubeflow.training import TrainingClient

TrainingClient().train(
    name="torch-ddp",
    func=train_func,
    num_nodes=5,
    resources_per_node={"gpu": 2},
)

Example of LLM Fine-Tuning

This example shows how to create TrainJob to fine-tune LLama 7b:

apiVersion: kubeflow.org/v2alpha1
kind: TrainJob
metadata:
  name: tune-llama-with-yelp
  namespace: tenant-alpha
spec:
  trainingRuntimeRef:
    name: torch-tune-llama-7b
  datasetConfig:
    storageUri: s3://dataset/custom-dataset/yelp-review
    parameters:
      split: train[:5000]
  modelConfig:
    input:
      storageUri: hf://yelp-review-full
    output:
      storageUri: s3://trained-model

The Trainer API

The Trainer represents the APIs that data scientists can use to configure trainer settings:

type Trainer struct {

  // Docker image for the Trainer.
  Image string `json:"image,omitempty"`

  // Command for the training container.
  // Validate that command contains torchrun or mpirun.
  Command []string `json:"command,omitempty"`

  // Args for the training container.
  Args []string `json:"args,omitempty"`

  // Env for the training container.
  Env []corev1.EnvVar `json:"env,omitempty"`

  // Number of training nodes.
  NumNodes *int32 `json:"numNodes,omitempty"`

  // Resource for each node.
  ResourcesPerNode []corev1.resources `json:"resourcesPerNode,omitempty"`

  // Number of processes in a single node.
  // By default this value == number of GPUs in resources limits.
  NumProcPerNode *int32 `json:"numProcPerNode,omitempty"`
}

The following table explains how TrainingRuntime parameters will be overridden with Trainer.

`Trainer` Parameter	`TrainingRuntime` Parameter
`.image`	`.spec.replicatedJobs[name=’Node’].template.spec.template.spec.containers[name=’trainer’].image`
`.command`	`.spec.replicatedJobs[name=’Node’].template.spec.template.spec.containers[name=’trainer’].command`
`.args`	`.spec.replicatedJobs[name=’Node’].template.spec.template.spec.containers[name=’trainer’].args`
`.env`	`.spec.replicatedJobs[name=’Node’].template.spec.template.spec.containers[name=’trainer’].env`
`.numNodes`	`.spec.numNodes`
`.resourcesPerNode`	`.spec.replicatedJobs[name=’Node’].template.spec.template.spec.containers[name=’trainer’].resources`

The Dataset Config API

The DatasetConfig represents the APIs that data scientists can use to configure dataset location.

type DatasetConfig struct {

	// Storage uri for the dataset provider.
	StorageUri string `json:"storageUri"`

	// Custom parameters for the dataset initializer.
	Parameters *[string]string `json:"parameters,omitempty"`

  // Reference to the secrets to access dataset.
  SecretRef corev1.SecretReference `json:"secretRef,omitempty"`
}

Initially we will support the following dataset providers:

S3: storageUri: s3://bucket-name/path/dataset
HuggingFace: storageUri: hf://repo-id

Parameters will be converted to the environment variables for the dataset-initializer container in the Initializer Job.

For example:

datasetConfig:
  storageUri: s3://datasets/yelp-review
  parameters:
    endpointUrl: s3.custom.com

Will be converted to:

replicatedJobs:
  - name: Initializer
    template:
      spec:
        template:
          spec:
            containers:
              - name: dataset-initializer
                image: docker.io/kubeflow/dataset-initializer
                env:
                  - name: STORAGE_URI
                    value: s3://dataset/yelp-review
                  - name: ENDPOINT_URL
                    value: s3.custom.com

The Model Config API

The ModelConfig represents the APIs that data scientists can use to configure pre-trained model input and output location.

type ModelConfig struct {
	// Configuration for pre-trained model.
	Input *InputModel `json:"input,omitempty"`

	// Configuration for trained model.
	Output *OutputModel `json:"output,omitempty"`
}

type InputModel struct {
	// Storage uri for the model provider.
	StorageUri string `json:"storageUri"`

	// Custom parameters for the model initializer.
	Parameters *[string]string `json:"parameters,omitempty"`

	// Reference to the secrets to access model.
	SecretRef corev1.SecretReference `json:"secretRef,omitempty"`
}

type OutputModel struct {
	// Storage uri for the model exported.
	StorageUri string `json:"storageUri"`

	// Custom parameters for the model exporter.
	Parameters *[string]string `json:"parameters,omitempty"`

	// Reference to the secrets to export model.
	SecretRef corev1.SecretReference `json:"secretRef,omitempty"`
}

The Input Model API

Initially we will support the following model providers:

HuggingFace: storageUri: hf://model-name

Parameters will be converted to the environment variables for the model-initializer container in the Initializer Job.

For example:

modelConfig:
  storageUri: hf://bert-based-cased
  parameters:
    transformerType: AutoModelForCausalLM

Will be converted to:

replicatedJobs:
  - name: Initializer
    template:
      spec:
        template:
          spec:
            containers:
              - name: model-initializer
                image: docker.io/kubeflow/model-initializer
                env:
                  - name: STORAGE_URI
                    value: hf://bert-based-cased
                  - name: TRANSFORMER_TYPE
                    value: AutoModelForCausalLM

The Output Model API

After initial implementation of TrainJob and TrainingRuntime, we will support ability to export the trained model. The following runtime can be implemented:

apiVersion: kubeflow.org/v2alpha1
kind: ClusterTrainingRuntime
metadata:
  name: torch-tune-llama-7b-export
spec:
  numNodes: 1
  startupPolicy:
    startupPolicyOrder: InOrder
  replicatedJobs:
    - name: Initializer
      template:
        spec:
          template:
            spec:
              containers:
                - name: dataset-initializer
                  image: docker.io/kubeflow/dataset-initializer
                  env:
                    - name: STORAGE_URI
                      value: hf://tatsu-lab/alpaca
                  volumeMounts:
                    - mountPath: /workspace/dataset
                      name: dataset-initializer
                - name: model-initializer
                  image: docker.io/kubeflow/model-initializer
                  env:
                    - name: STORAGE_URI
                      value: hf://meta-llama/Llama-2-7b
                  volumeMounts:
                    - mountPath: /workspace/model
                      name: model-initializer
              volumes:
                - name: dataset-initializer
                  persistentVolumeClaim:
                    claimName: dataset-initializer
                - name: model-initializer
                  persistentVolumeClaim:
                    claimName: model-initializer
    - name: Node
      template:
        spec:
          template:
            spec:
              containers:
                - name: trainer
                  image: docker.io/kubeflow/llm-trainer
                  env:
                    - name: MASTER_ADDR
                      value: "pytorch-node-0-0.pytorch"
                    - name: MASTER_PORT
                      value: 29400
                    - name: LORA_CONFIG
                      value: |
                        {"peft_type": "LORA", "r": 8, "lora_alpha": 16}
                  command:
                    - torchrun hf_llm_training.py
                  resources:
                    limits:
                      nvidia.com/gpu: 2
                  volumeMounts:
                    - mountPath: /workspace/dataset
                      name: dataset-initializer
                    - mountPath: /workspace/pre-trained-model
                      name: model-initializer
                    - mountPath: /workspace/adapters
                      name: model-exporter
              volumes:
                - name: dataset-initializer
                  persistentVolumeClaim:
                    claimName: dataset-initializer
                - name: model-initializer
                  persistentVolumeClaim:
                    claimName: model-initializer
                - name: model-exporter
                  persistentVolumeClaim:
                    claimName: model-exporter
    - name: Exporter
      template:
        spec:
          template:
            spec:
              containers:
                - name: model-exporter
                  image: docker.io/kubeflow/model-exporter
                  volumeMounts:
                    - mountPath: /workspace/adapters
                      name: model-exporter
              volumes:
                - name: model-exporter
                  persistentVolumeClaim:
                    claimName: model-exporter

The Pod Spec Overrides APIs

The PodSpecOverrides represents overrides for the TrainingRuntime when TrainJob is created. These parameters can include the user's identity or PVC.

Usually, these parameters should not be configured by the user and should be attached during the orchestration (e.g. using Kubernetes admission webhooks or custom clients).

In the future, we can add more parameters if we find use-cases when it is required.

type PodSpecOverride struct {
    // Name of the training replica in the training runtime template to override
    TargetReplicatedJobs []string `json:"targetReplicatedJobs"`

    // Override parameters for Containers.
    Containers []Container `json:"container,omitempty"`

    // Override parameters for InitContainers.
    InitContainer []Container `json:"initContainer,omitempty"`

    // Override parameters for volumes.
    Volumes []corev1.Volume `json:"volume,omitempty"`

    // Custom Service Account
    ServiceAccountName string `json:"serviceAccountName,omitempty"`
}

// Override for each container.
// Parameters from Trainer, DatasetConfig, and ModelConfig will take precedence.
type Container struct {

    // Name for the container.
    Name string `json:"name"`

    // Command for the container.
    Command []string `json:"command,omitempty" protobuf:"bytes,3,rep,name=command"`

    // Args for the container.
    Args []string `json:"args,omitempty"`

    // Env for the container.
    Env []corev1.EnvVar `json:"env,omitempty"`

    // Env for the container.
    EnvFrom []corev1.EnvFromSource `json:"envFrom,omitempty"`

    // Override parameters for volume mounts.
    VolumeMounts []VolumeMount `json:"volumeMounts,omitempty"`
}

Example of TrainJob with Overrides

This example shows how to override user-identity for sidecar container and add volume to the trainer container.

apiVersion: kubeflow.org/v2alpha1
kind: TrainJob
metadata:
  name: pytorch-distributed
  namespace: tenant-alpha
spec:
  trainingRuntimeRef:
    name: pytorch-distributed-gpu
  trainer:
    image: docker.io/custom-training
  podSpecOverrides:
    - targetReplicatedJobs:
        - initializer
          node
      containers:
        - name: user-identity
          value: 123
        - name: trainer
          volumeMounts:
            - name: user-123-volume
              mountPath: /workspace
      volumes:
        - name: user-123-volume
          persistentVolumeClaim:
            claimName: user-123-volume

The Training Runtime API

The TrainingRuntime is the pre-created configurations of model training on the cluster, representing as blueprints. For example, Elastic PyTorch training, MPI DeepSpeed configuration, BERT LLM Fine-Tuning.

These blueprints can be deployed within the Training Operator control plane and stored in a Kubeflow public repository that users can apply to their clusters.

Platform or ML engineers can tweak existing blueprints, based on their requirements. For example, using custom configurations.

The Kubeflow Training Operator can maintain more Training Runtimes when the community is ready to support them. For example, runtimes for Jax or MLX. Initially, we will support: PyTorch, MPI, TensorFlow, XGBoost, and PaddlePaddle.

The TrainingRuntime is immutable, and so to make a change, a new version of the TrainingRuntime must be created and then changing the TrainJob to point to the new version. This provides control as to how changes to runtimes propagate to existing training jobs. For example, when training is running for a long time (e.g. 1-2 months).

In the future implementation, we will introduce a revision control mechanism similar to Kubernetes Deployment to control versions of TrainingRuntime and enable rolling updates.

We are going to create two CRDs: TrainingRuntime and ClusterTrainingRuntime. These runtimes have exactly the same APIs, but the first one is the namespace-scoped, the second is the cluster-scoped. If trainingRuntimeRef in TrainJob has the namespace, controller will use the TrainingRuntime, otherwise it will use the ClusterTrainingRuntime.

type TrainingRuntime struct {
    metav1.TypeMeta   `json:",inline"`
    metav1.ObjectMeta `json:"metadata,omitempty"`

    // Framework specific parameters.
    MLSpec *MLSpec `json:"mlSpec,omitempty"`

    // Number of nodes to execute training.
    NumNodes int `json:"numNodes,omitempty"`

    // JobSet spec.
    JobSetSpec *batchv1.JobSetSpec `json:",inline"`

    // For gang-scheduling using volcano or scheduler plugins, supported for all frameworks.
    GangScheduler *GangScheduler `json:"gangScheduler,omitempty"`
}

// One of the specs can be selected.
type MLSpec struct {

    // Custom Spec for Torch
    TorchSpec *TorchSpec `json:"torchSpec,omitempty"`

    // Custom Spec for MPI
    MPISpec *MPISpec `json:"mpiSpec,omitempty"`
}

The Gang Scheduler API

Gang scheduler plugin is used to create appropriate PodGroup for Volcano or scheduler plugins.

type GangScheduler struct {
    // Plugin for gang scheduling.
    Plugin *GangSchedulerPlugin `json:plugin,omitempty"`

    // Time threshold to schedule PodGroup for gang scheduling.
    ScheduleTimeoutSeconds string `json:scheduleTimeoutSeconds,omitempty"`
}

type GangSchedulerPlugin string

const (
    GangSchedulerPluginVolcano GangSchedulerPlugin = "volcano"
    GangSchedulerPlugins       GangSchedulerPlugin = "scheduler-plugins"
)

The Torch Spec API

The TorchSpec API represents the configuration for the PyTorch distributed training. This configuration allows platform engineers to explicitly configure torchrun setting.

The distributed parameters are taken from the PyTorch distributed launch run.

For Elastic Training we will always pass the following parameters:

--rdzv-backend=c10d

--rdzv-id will be set automatically.

--rdzv-endpoint will always point to the node-0 Pod.

Since the etcd and etcd-v2 are legacy rendezvous, we won't support them in TorchSpec. We can introduce them in the future if users will require them.

// TorchSpec represents the configuration for PyTorch.
type TorchSpec struct {

    // Number of Procs per Node.
    NumProcPerNode int `json:"numProcPerNode,omitempty"`

    // Used for single-node multi-worker training
    Standalone bool `json:"standalone,omitempty"`

    // Torch Elastic Policy.
    ElasticPolicy *TorchElasticPolicy `json:"elasticPolicy,omitempty"`
}

// If the Elastic Policy is set, the numNodes parameter is ignored.
// --nnodes=minNodes:maxNodes
type TorchElasticPolicy struct {

    // The limits to restart TrainJob.
    // Insert it to the JobSet.spec.failurePolicy.maxRestarts
    MaxRestarts *in32 `json:"maxRestarts,omitempty"`

    // Min number of nodes for HPA and torchrun.
    MinNodes *in32 `json:"minNodes,omitempty"`

    // Max number of nodes for HPA and torchrun.
    MaxNodes *in32 `json:"maxNodes,omitempty"`

    // Metrics for scale up and down replicas.
    Metrics []autoscalingv2.MetricSpec `json:"metrics,omitempty"`
}

The MPI Spec API

The MPISpec API represents the configuration for training using MPI orchestration. E.g. creation of host-files and SSH keys. Using MPI might be more efficient for training on HPC clusters or for some ML frameworks (e.g. MLX distributed with MPI).

We will fully migrate to the MPI Operator V2 functionality as part of this KEP. Check the proposal for the MPI V2 APIs.

type MPISpec struct {
    // Number of Procs per Node.
    NumProcPerNode int `json:"numProcPerNode,omitempty"`

    // MPI Implementation to create appropriate host-files.
    // Can be one of OpenMPI, Intel, or MPICH.
    MPIImplementation MPIImplementation `json:"mpiImplementation,omitempty"`

    // Directory where SSH keys are mounted.
    SSHAuthMountPath string `json:"SSHAuthMountPath,omitempty"`
}

type MPIImplementation string

const (
    MPIImplementationOpenMPI MPIImplementation = "OpenMPI"
    MPIImplementationIntel   MPIImplementation = "Intel"
    MPIImplementationMPICH   MPIImplementation = "MPICH"
)

Supported Runtimes by Community

Kubeflow community are planning to support the following runtimes.

PyTorch Distributed Runtime

Initially, we will maintain only multi-node multi-worker runtime and PyTorch Elastic.

apiVersion: kubeflow.org/v2alpha1
kind: ClusterTrainingRuntime
metadata:
  name: torch-distributed-multi-node
spec:
  mlSpec:
    torch:
      numProcPerNode: 5
  replicatedJobs:
    - name: node
      template:
        spec:
          template:
            spec:
              containers:
                - name: trainer
                  image: docker.io/kubeflow/pytorch-mnist
                  env:
                    - name: MASTER_ADDR
                      value: "pytorch-node-0-0.pytorch"
                    - name: MASTER_PORT
                      value: 29400
                  command:
                    - torchrun train.py

Example of usage:

apiVersion: kubeflow.org/v2alpha1
kind: TrainJob
metadata:
  name: torch-test
  namespace: tenant-alpha
spec:
  trainingRuntimeRef:
    name: torch-distributed-multi-node
  trainer:
    resourcesPerNode:
      requests:
        nvidia.com/gpu: 1
    args:
      - num-epochs=5

PyTorch Elastic Runtime

Training runtime for PyTorch Elastic:

apiVersion: kubeflow.org/v2alpha1
kind: ClusterTrainingRuntime
metadata:
  name: torch-distributed-elastic
spec:
  mlSpec:
    torchSpec:
      elasticPolicy:
        minNodes: 5
        maxNodes: 10
        metrics:
          - type: Resource
            resource:
              name: cpu
              target:
                type: Utilization
                averageUtilization: 80
  replicatedJobs:
    - name: node
      template:
        spec:
          template:
            spec:
              containers:
                - name: trainer
                  image: docker.io/kubeflow/pytorch-mnist
                  env:
                    - name: MASTER_ADDR
                      value: "pytorch-node-0-0.pytorch"
                    - name: MASTER_PORT
                      value: 29400
                  command:
                    - torchrun train.py

Additional PyTorch Runtimes

The following runtimes can be maintained in the future.

Single worker training:

apiVersion: kubeflow.org/v2alpha1
kind: ClusterTrainingRuntime
metadata:
  name: torch-simple
spec:
  replicatedJobs:
    - name: node
      template:
        spec:
          template:
            spec:
              containers:
                - name: trainer
                  image: docker.io/kubeflow/pytorch-mnist
                  command:
                    - torchrun train.py

Single node multi worker training:

apiVersion: kubeflow.org/v2alpha1
kind: ClusterTrainingRuntime
metadata:
  name: torch-distributed-single-worker
spec:
  mlSpec:
    torch:
      numProcPerNode: 5
      standalone: True
  replicatedJobs:
    - name: Node
      template:
        spec:
          template:
            spec:
              containers:
                - name: trainer
                  image: docker.io/kubeflow/pytorch-mnist
                  env:
                    - name: MASTER_ADDR
                      value: "pytorch-node-0-0.pytorch"
                    - name: MASTER_PORT
                      value: 29400
                  command:
                    - torchrun train.py

LLM Fine-Tuning Runtimes

In the future, we can consider to use the torchtune CLI for Fine-Tuning with PyTorch.

Llama 7b

The following runtime can be used for Llama 7b model.

apiVersion: kubeflow.org/v2alpha1
kind: ClusterTrainingRuntime
metadata:
  name: torch-tune-llama-7b
spec:
  numNodes: 1
  startupPolicy:
    startupPolicyOrder: InOrder
  replicatedJobs:
    - name: Initializer
      template:
        spec:
          template:
            spec:
              containers:
                - name: dataset-initializer
                  image: docker.io/kubeflow/dataset-initializer
                  env:
                    - name: STORAGE_URI
                      value: hf://tatsu-lab/alpaca
                  volumeMounts:
                    - mountPath: /workspace/dataset
                      name: dataset-initializer
                - name: model-initializer
                  image: docker.io/kubeflow/model-initializer
                  env:
                    - name: STORAGE_URI
                      value: hf://meta-llama/Llama-2-7b
                    - name: TRANSFORMER_TYPE
                      value: AutoModelForCausalLM
                  volumeMounts:
                    - mountPath: /workspace/model
                      name: model-initializer
              volumes:
                - name: dataset-initializer
                  persistentVolumeClaim:
                    claimName: dataset-initializer
                - name: model-initializer
                  persistentVolumeClaim:
                    claimName: model-initializer
    - name: Node
      template:
        spec:
          template:
            spec:
              containers:
                - name: trainer
                  image: docker.io/kubeflow/llm-trainer
                  env:
                    - name: MASTER_ADDR
                      value: "pytorch-node-0-0.pytorch"
                    - name: MASTER_PORT
                      value: 29400
                    - name: TRANSFORMER_TYPE
                      value: AutoModelForCausalLM
                    - name: LORA_CONFIG
                      value: |
                        {"peft_type": "LORA", "r": 8, "lora_alpha": 16}
                  command:
                    - torchrun hf_llm_training.py
                  resources:
                    limits:
                      nvidia.com/gpu: 2
                  volumeMounts:
                    - mountPath: /workspace/dataset
                      name: dataset-initializer
                    - mountPath: /workspace/model
                      name: model-initializer
              volumes:
                - name: dataset-initializer
                  persistentVolumeClaim:
                    claimName: dataset-initializer
                - name: model-initializer
                  persistentVolumeClaim:
                    claimName: model-initializer

Gemma 7b

The following runtime can be used for Gemma fine-tuning.

apiVersion: kubeflow.org/v2alpha1
kind: ClusterTrainingRuntime
metadata:
  name: torch-tune-gemma-7b
spec:
  numNodes: 1
  startupPolicy:
    startupPolicyOrder: InOrder
  replicatedJobs:
    - name: Initializer
      template:
        spec:
          template:
            spec:
              containers:
                - name: dataset-initializer
                  image: docker.io/kubeflow/dataset-initializer
                  env:
                    - name: STORAGE_URI
                      value: hf://tatsu-lab/alpaca
                  volumeMounts:
                    - mountPath: /workspace/dataset
                      name: dataset-initializer
                - name: model-initializer
                  image: docker.io/kubeflow/model-initializer
                  env:
                    - name: STORAGE_URI
                      value: hf://google/gemma-7b
                    - name: TRANSFORMER_TYPE
                      value: AutoModelForCausalLM
                  volumeMounts:
                    - mountPath: /workspace/model
                      name: model-initializer
              volumes:
                - name: dataset-initializer
                  persistentVolumeClaim:
                    claimName: dataset-initializer
                - name: model-initializer
                  persistentVolumeClaim:
                    claimName: model-initializer
    - name: Node
      template:
        spec:
          template:
            spec:
              containers:
                - name: trainer
                  image: docker.io/kubeflow/llm-trainer
                  env:
                    - name: MASTER_ADDR
                      value: "pytorch-node-0-0.pytorch"
                    - name: MASTER_PORT
                      value: 29400
                    - name: TRANSFORMER_TYPE
                      value: AutoModelForCausalLM
                    - name: LORA_CONFIG
                      value: |
                        {"peft_type": "LORA", "r": 8, "lora_alpha": 16}
                  command:
                    - torchrun hf_llm_training.py
                  resources:
                    limits:
                      nvidia.com/gpu: 2
                  volumeMounts:
                    - mountPath: /workspace/dataset
                      name: dataset-initializer
                    - mountPath: /workspace/model
                      name: model-initializer
              volumes:
                - name: dataset-initializer
                  persistentVolumeClaim:
                    claimName: dataset-initializer
                - name: model-initializer
                  persistentVolumeClaim:
                    claimName: model-initializer

MPI Runtime

For MPI, we can add support the DeepSpeed runtimes.

Example of simple OpenMPI runtime:

apiVersion: kubeflow.org/v2alpha1
kind: ClusterTrainingRuntime
metadata:
  name: mpi-simple
spec:
  mlSpec:
    mpi:
      mpiImplementation: OpenMPI
      numProcPerNode: 5
  numNodes: 5
  replicatedJobs:
    - name: Launcher
      template:
        spec:
          template:
            spec:
              containers:
                - name: mpi-launcher
                  image: docker.io/mpi-launch
                  command:
                    - mpirun -np 5 --host mpi-simple.default.svc
    - name: Node
      template:
        spec:
          template:
            spec:
              containers:
                - name: trainer
                  image: docker.io/mpi-training
                  command:
                    - mpirun -np 2 train.py

TensorFlow Runtime

Will be added after initial implementation for PyTorch.

XGBoost Runtime

Will be added after initial implementation for PyTorch.

Paddle Runtime

Will be added after initial implementation for PyTorch.

Jax Runtime

Will be added after initial implementation for PyTorch.

Migration from Kubeflow Training V1

These API changes will not be compatible with Training Operator V1 APIs. Thus, existing users have to migrate to the newer APIs. Kubeflow community will provide instructions on how to migrate existing training jobs to the new APIs.

PyTorchJob Migration

The following example shows how to migrate from PyTorchJob to TrainingRuntime:

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: pytorch-simple
  namespace: kubeflow
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
              imagePullPolicy: Always
              command:
                - "python3"
                - "/opt/pytorch-mnist/mnist.py"
                - "--epochs=1"
    Worker:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
              imagePullPolicy: Always
              command:
                - "python3"
                - "/opt/pytorch-mnist/mnist.py"
                - "--epochs=1"

apiVersion: kubeflow.org/v2alpha1
kind: TrainingRuntime
metadata:
  name: torch-distributed-multi-node
spec:
  numNodes: 2
  replicatedJobs:
    - name: node
      template:
        spec:
          template:
            spec:
              containers:
                - name: trainer
                  image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
                  env:
                    - name: MASTER_ADDR
                      value: "pytorch-node-0-0.pytorch"
                    - name: MASTER_PORT
                      value: 29400
                  command:
                    - torchrun train.py

Files

README.md

Latest commit

History

README.md

File metadata and controls

KEP-2170: Kubeflow Training V2 API

Authors

Overview

Motivation

User Value

Personas

Goals

Non-Goals

Design Details

User Roles Diagram

LLM Fine-Tuning Diagram

Worker and Node Definition

The TrainJob API

Example of TrainJob

Example of LLM Fine-Tuning

The Trainer API

The Dataset Config API

The Model Config API

The Input Model API

The Output Model API

The Pod Spec Overrides APIs

Example of TrainJob with Overrides

The Training Runtime API

The Gang Scheduler API

The Torch Spec API

The MPI Spec API

Supported Runtimes by Community

PyTorch Distributed Runtime

PyTorch Elastic Runtime

Additional PyTorch Runtimes

LLM Fine-Tuning Runtimes

Llama 7b

Gemma 7b

MPI Runtime

TensorFlow Runtime

XGBoost Runtime

Paddle Runtime

Jax Runtime

Migration from Kubeflow Training V1

PyTorchJob Migration