- Andrey Velichkevich - @andreyvelich
- Yuki Iwai - @tenzen-y
Creation date: 2024-07-16
Google doc: https://bit.ly/3WzjTlw
This document discusses the new Kubeflow Training V2 API.
When we built the
Kubeflow Training Operator a couple of years ago,
Kubernetes lacked better features to support distributed machine learning (ML) training, such as
SuccessPolicy and RestartPolicy (FailurePolicy). Recently, the Kubernetes community launched the
working group Batch, and then the working group actively worked on evolving the batch/v1 Job
API
and built a new Kubernetes SIGs project: JobSet
to
manage groups of Jobs
.
This document consolidates efforts for the Cloud Native ML Training between Kubeflow and Kubernetes communities.
We often implement features similar to batch/v1 Job
, such as “suspend”, on the Training Operator
side since the Training Operator creates blocks of plain Pod and Service for each rank once
Kubeflow Jobs are created. However, if we continue taking the same approach to use lowest level
abstractions that introduce redundancy, the maintenance costs will continue to increase.
Replacing repetitive infrastructure layers with JobSet
would help to avoid redundancy and reduce
developer toil.
Additionally, introducing JobSet
as an infrastructure layer would allow us to introduce batch
workload features such as
the PodFailurePolicy
and the PodDisruptionCondition
easily.
Please also see the Kubernetes JobSet and Kubeflow Training Operator collaboration document.
In addition to the above motivation, we will address the following user feedback while implementation:
- Confusion around Workers: #1790
- Support batch/v1
Job
features: #1718 - ExitCodes for PodFailurePolicy: #1749
- Migrate to MPI V2 API: #1906
We can identify the following personas of Training Operator:
- DevOps Engineer. They are familiar with Kubernetes concepts and they know how to manage the Kubernetes workloads. Usually, they are not experts in ML frameworks and ML algorithms.
- MLOps Engineer. They are familiar with ML frameworks and they know how to configure distributed PyTorch settings such as rendezvous backends or MPI configuration. Usually, they are not experts in Kubernetes and ML algorithms.
- Data Scientists. They create model architectures and advanced ML algorithms to train models.
They prefer to use Python for their work. They are aware of
torch.nn
APIs, but not withtorch.distributed
and Kubernetes concepts to scale model training.
Based on the above personas, we should build an API that everyone will benefit from.
- Introduce the
TrainingRuntime
andClusterTrainingRuntime
APIs that will store blueprints for model training and LLM fine-tuning using various ML frameworks. These runtimes will be built on top ofJobSet
APIs with additional functionality for special use-cases. For example, training using MPI orchestration. - Introduce Kubeflow
TrainJob
API that allows to reuse these runtimes and quickly start a new training job without understanding complex Kubernetes APIs. - Update Kubeflow Training SDK to allow data scientists quickly create and monitor
TrainJobs
. - Create community-supported
ClusterTrainingRuntime
for distributed training with PyTorch and MPI. - Create community-supported
ClusterTrainingRuntime
for LLM fine-tuning for various foundational models (e.g. Mistral, LLama-70b, Gemma-7b). - Work on the following
JobSet
improvements: kubernetes-sigs/jobset#463 and kubernetes-sigs/jobset#572
- Support MPI V1 implementation.
- Distributed training for TensorFlow, XGboost, JAX, and PaddlePaddle will be added after initial implementation.
- Migrate Kubeflow V1 controller to use
JobSet
.
We propose these APIs:
-
TrainJob
: A single API which allows data scientists to initiate a training and fine-tuning job from the pre-deployed training runtime. It allows users to tweak configurations for their training jobs such as model parameters, dataset parameters, or trainer configuration. The main goal is to hide unnecessary Kubernetes complexity for data scientists. -
TrainingRuntime
andClusterTrainingRuntime
: Set of blueprints for how to start various types of training or fine-tuning jobs. Runtimes are managed by Platform Engineers and allow them to configure infrastructure parameters that are required for the TrainJob. For example, failure policy or gang-scheduling.
The below diagram shows how platform engineers manage TrainingRuntime
and how data scientists
create TrainJob
:
TrainJob
can be created using kubectl
or Kubeflow Python SDK.
The below diagram shows which resources will be created for LLM fine-tuning with PyTorch:
To better understand what does Nodes and Worker mean in the diagram above, the following table explains naming that each framework or technology uses:
ML Framework or Technology | Definition of a Single Device (GPU) | Definition of a Single VM | Start Command | Reference Docs |
Kubernetes | Container Resource Unit | Pod’s Container | Any | Resource units in K8s |
PyTorch | Worker
|
Node
|
torchrun
|
PyTorch Elastic |
MPI (OpenMPI) | Slot
|
Node
|
mpirun
|
Reference for OpenMPI |
TensorFlow | Worker | Worker Pool | python
|
TensorFlow Distributed |
Jax | Process jax.local_devices()
|
Host
|
python or mpirun
|
Jax Distributed |
PaddlePaddle | Worker | Node | python -m paddle.distributed.launch
|
Paddle Distributed |
XGBoost | Worker | Not Applicable | python
|
Rabit Tracker for c10d |
DeepSpeed | Slot | Node
|
deepspeed
|
DeepSpeed Distributed |
Additionally, check this document for the mpirun
command
for other MPI implementations: Intel MPI, MPICH, Spectrum MPI.
The TrainJob
exposes APIs that data scientist can override in TrainingRuntime
to create training job:
type TrainJob struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
// Spec defines the desired state of TrainJob.
Spec TrainJobSpec `json:"spec"`
// Status defines the current state of TrainJob.
Status TrainJobStatus `json:"status,omitempty"`
}
type TrainJobSpec struct {
// Reference to the Training Runtime.
TrainingRuntimeRef *TrainingRuntimeRef `json:"trainingRuntimeRef"`
// Parameters that data scientists can override
Trainer *Trainer `json:"trainer,omitempty"`
// Configuration for training dataset
DatasetConfig *DatasetConfig `json:"datasetConfig,omitempty"`
// Configuration for the pre-trained model and location for model output
ModelConfig *ModelConfig `json:"modelConfig,omitempty"`
// Custom metadata to apply for Job, JobSet, etc.
Labels map[string]string `json:"labels,omitempty"`
Annotations map[string]string `json:"annotations,omitempty"`
}
type TrainingRuntimeRef struct {
// Name for the training runtime.
Name string `json:"name"`
// Namespace for the runtime.
// If namespace is set, TrainingRuntime is used. Otherwise, ClusterTrainingRuntime is used.
Namespace string `json:"namespace,omitempty"`
}
type TrainJobStatus struct {
// Conditions for the TrainJob
Conditions []metav1.Condition `json:"conditions,omitempty"`
}
This table explain rationale for each TrainJob
parameter:
Parameter | What is it ? |
TrainingRuntimeRef
|
Reference to the existing TrainingRuntime that pre-deployed by platform engineers
|
Trainer
|
Configuration for the Trainer such as image, number of nodes, accelerators. |
ModelConfig
|
Configuration for the pre-trained model and location for model output |
DatasetConfig
|
Configuration for the dataset that will be used to train or fine-tune model |
Labels and Annotations | Custom metadata that needs to be applied to the TrainJob resources: JobSet, Job, Pods.
|
PodSpecOverrides
|
Custom overrides that are specific to the TrainJob and need to be applied to the
TrainJob resources. For example, the user identity. Usually, it is managed by
custom admission webhooks that inject data to the TrainJob after user creates it
via Python SDK or kubectl
|
apiVersion: kubeflow.org/v2alpha1
kind: TrainJob
metadata:
name: torch-ddp
namespace: tenant-alpha
spec:
trainingRuntimeRef:
name: torch-distributed-multi-node
trainer:
image: docker.io/custom-training
command:
- torchrun train.py
numNodes: 5
resourcesPerNode:
requests:
nvidia.com/gpu: 2
The above command will be converted as follows:
torchrun --nnodes=5 --nproc-per-node=2 train.py
Additionally, the Kubeflow Training SDK allows to create the above TrainJob
using the Python API:
def train_func():
import torch
class Net(torch.nn.Module):
"""Create the PyTorch Model"""
...
model = Net()
# Attach model to the distributor
torch.distributed.init_process_group(backend="nccl")
model = torch.nn.parallel.DistributedDataParallel(model)
# Train model
model.train()
# Use Kubeflow SDK to create TrainJob.
from kubeflow.training import TrainingClient
TrainingClient().train(
name="torch-ddp",
func=train_func,
num_nodes=5,
resources_per_node={"gpu": 2},
)
This example shows how to create TrainJob
to fine-tune LLama 7b:
apiVersion: kubeflow.org/v2alpha1
kind: TrainJob
metadata:
name: tune-llama-with-yelp
namespace: tenant-alpha
spec:
trainingRuntimeRef:
name: torch-tune-llama-7b
datasetConfig:
storageUri: s3://dataset/custom-dataset/yelp-review
parameters:
split: train[:5000]
modelConfig:
input:
storageUri: hf://yelp-review-full
output:
storageUri: s3://trained-model
The Trainer
represents the APIs that data scientists can use to configure trainer settings:
type Trainer struct {
// Docker image for the Trainer.
Image string `json:"image,omitempty"`
// Command for the training container.
// Validate that command contains torchrun or mpirun.
Command []string `json:"command,omitempty"`
// Args for the training container.
Args []string `json:"args,omitempty"`
// Env for the training container.
Env []corev1.EnvVar `json:"env,omitempty"`
// Number of training nodes.
NumNodes *int32 `json:"numNodes,omitempty"`
// Resource for each node.
ResourcesPerNode []corev1.resources `json:"resourcesPerNode,omitempty"`
// Number of processes in a single node.
// By default this value == number of GPUs in resources limits.
NumProcPerNode *int32 `json:"numProcPerNode,omitempty"`
}
The following table explains how TrainingRuntime
parameters will be overridden with Trainer
.
Trainer Parameter
|
TrainingRuntime Parameter
|
.image
|
.spec.replicatedJobs[name=’Node’].template.spec.template.spec.containers[name=’trainer’].image
|
.command
|
.spec.replicatedJobs[name=’Node’].template.spec.template.spec.containers[name=’trainer’].command
|
.args
|
.spec.replicatedJobs[name=’Node’].template.spec.template.spec.containers[name=’trainer’].args
|
.env
|
.spec.replicatedJobs[name=’Node’].template.spec.template.spec.containers[name=’trainer’].env
|
.numNodes
|
.spec.numNodes
|
.resourcesPerNode
|
.spec.replicatedJobs[name=’Node’].template.spec.template.spec.containers[name=’trainer’].resources
|
The DatasetConfig
represents the APIs that data scientists can use to configure dataset location.
type DatasetConfig struct {
// Storage uri for the dataset provider.
StorageUri string `json:"storageUri"`
// Custom parameters for the dataset initializer.
Parameters *[string]string `json:"parameters,omitempty"`
// Reference to the secrets to access dataset.
SecretRef corev1.SecretReference `json:"secretRef,omitempty"`
}
Initially we will support the following dataset providers:
- S3:
storageUri: s3://bucket-name/path/dataset
- HuggingFace:
storageUri: hf://repo-id
Parameters will be converted to the environment variables for the dataset-initializer
container
in the Initializer
Job.
For example:
datasetConfig:
storageUri: s3://datasets/yelp-review
parameters:
endpointUrl: s3.custom.com
Will be converted to:
replicatedJobs:
- name: Initializer
template:
spec:
template:
spec:
containers:
- name: dataset-initializer
image: docker.io/kubeflow/dataset-initializer
env:
- name: STORAGE_URI
value: s3://dataset/yelp-review
- name: ENDPOINT_URL
value: s3.custom.com
The ModelConfig
represents the APIs that data scientists can use to configure pre-trained model
input and output location.
type ModelConfig struct {
// Configuration for pre-trained model.
Input *InputModel `json:"input,omitempty"`
// Configuration for trained model.
Output *OutputModel `json:"output,omitempty"`
}
type InputModel struct {
// Storage uri for the model provider.
StorageUri string `json:"storageUri"`
// Custom parameters for the model initializer.
Parameters *[string]string `json:"parameters,omitempty"`
// Reference to the secrets to access model.
SecretRef corev1.SecretReference `json:"secretRef,omitempty"`
}
type OutputModel struct {
// Storage uri for the model exported.
StorageUri string `json:"storageUri"`
// Custom parameters for the model exporter.
Parameters *[string]string `json:"parameters,omitempty"`
// Reference to the secrets to export model.
SecretRef corev1.SecretReference `json:"secretRef,omitempty"`
}
Initially we will support the following model providers:
- HuggingFace:
storageUri: hf://model-name
Parameters will be converted to the environment variables for the model-initializer
container
in the Initializer
Job.
For example:
modelConfig:
storageUri: hf://bert-based-cased
parameters:
transformerType: AutoModelForCausalLM
Will be converted to:
replicatedJobs:
- name: Initializer
template:
spec:
template:
spec:
containers:
- name: model-initializer
image: docker.io/kubeflow/model-initializer
env:
- name: STORAGE_URI
value: hf://bert-based-cased
- name: TRANSFORMER_TYPE
value: AutoModelForCausalLM
After initial implementation of TrainJob
and TrainingRuntime
, we will support ability to export
the trained model. The following runtime can be implemented:
apiVersion: kubeflow.org/v2alpha1
kind: ClusterTrainingRuntime
metadata:
name: torch-tune-llama-7b-export
spec:
numNodes: 1
startupPolicy:
startupPolicyOrder: InOrder
replicatedJobs:
- name: Initializer
template:
spec:
template:
spec:
containers:
- name: dataset-initializer
image: docker.io/kubeflow/dataset-initializer
env:
- name: STORAGE_URI
value: hf://tatsu-lab/alpaca
volumeMounts:
- mountPath: /workspace/dataset
name: dataset-initializer
- name: model-initializer
image: docker.io/kubeflow/model-initializer
env:
- name: STORAGE_URI
value: hf://meta-llama/Llama-2-7b
volumeMounts:
- mountPath: /workspace/model
name: model-initializer
volumes:
- name: dataset-initializer
persistentVolumeClaim:
claimName: dataset-initializer
- name: model-initializer
persistentVolumeClaim:
claimName: model-initializer
- name: Node
template:
spec:
template:
spec:
containers:
- name: trainer
image: docker.io/kubeflow/llm-trainer
env:
- name: MASTER_ADDR
value: "pytorch-node-0-0.pytorch"
- name: MASTER_PORT
value: 29400
- name: LORA_CONFIG
value: |
{"peft_type": "LORA", "r": 8, "lora_alpha": 16}
command:
- torchrun hf_llm_training.py
resources:
limits:
nvidia.com/gpu: 2
volumeMounts:
- mountPath: /workspace/dataset
name: dataset-initializer
- mountPath: /workspace/pre-trained-model
name: model-initializer
- mountPath: /workspace/adapters
name: model-exporter
volumes:
- name: dataset-initializer
persistentVolumeClaim:
claimName: dataset-initializer
- name: model-initializer
persistentVolumeClaim:
claimName: model-initializer
- name: model-exporter
persistentVolumeClaim:
claimName: model-exporter
- name: Exporter
template:
spec:
template:
spec:
containers:
- name: model-exporter
image: docker.io/kubeflow/model-exporter
volumeMounts:
- mountPath: /workspace/adapters
name: model-exporter
volumes:
- name: model-exporter
persistentVolumeClaim:
claimName: model-exporter
The PodSpecOverrides
represents overrides for the TrainingRuntime
when TrainJob
is created.
These parameters can include the user's identity or PVC.
Usually, these parameters should not be configured by the user and should be attached during the orchestration (e.g. using Kubernetes admission webhooks or custom clients).
In the future, we can add more parameters if we find use-cases when it is required.
type PodSpecOverride struct {
// Name of the training replica in the training runtime template to override
TargetReplicatedJobs []string `json:"targetReplicatedJobs"`
// Override parameters for Containers.
Containers []Container `json:"container,omitempty"`
// Override parameters for InitContainers.
InitContainer []Container `json:"initContainer,omitempty"`
// Override parameters for volumes.
Volumes []corev1.Volume `json:"volume,omitempty"`
// Custom Service Account
ServiceAccountName string `json:"serviceAccountName,omitempty"`
}
// Override for each container.
// Parameters from Trainer, DatasetConfig, and ModelConfig will take precedence.
type Container struct {
// Name for the container.
Name string `json:"name"`
// Command for the container.
Command []string `json:"command,omitempty" protobuf:"bytes,3,rep,name=command"`
// Args for the container.
Args []string `json:"args,omitempty"`
// Env for the container.
Env []corev1.EnvVar `json:"env,omitempty"`
// Env for the container.
EnvFrom []corev1.EnvFromSource `json:"envFrom,omitempty"`
// Override parameters for volume mounts.
VolumeMounts []VolumeMount `json:"volumeMounts,omitempty"`
}
This example shows how to override user-identity for sidecar container and add volume to the trainer container.
apiVersion: kubeflow.org/v2alpha1
kind: TrainJob
metadata:
name: pytorch-distributed
namespace: tenant-alpha
spec:
trainingRuntimeRef:
name: pytorch-distributed-gpu
trainer:
image: docker.io/custom-training
podSpecOverrides:
- targetReplicatedJobs:
- initializer
node
containers:
- name: user-identity
value: 123
- name: trainer
volumeMounts:
- name: user-123-volume
mountPath: /workspace
volumes:
- name: user-123-volume
persistentVolumeClaim:
claimName: user-123-volume
The TrainingRuntime
is the pre-created configurations of model training on the cluster,
representing as blueprints. For example, Elastic PyTorch training, MPI DeepSpeed configuration,
BERT LLM Fine-Tuning.
These blueprints can be deployed within the Training Operator control plane and stored in a Kubeflow public repository that users can apply to their clusters.
Platform or ML engineers can tweak existing blueprints, based on their requirements. For example, using custom configurations.
The Kubeflow Training Operator can maintain more Training Runtimes when the community is ready to support them. For example, runtimes for Jax or MLX. Initially, we will support: PyTorch, MPI, TensorFlow, XGBoost, and PaddlePaddle.
The TrainingRuntime
is immutable, and so to make a change, a new version of the TrainingRuntime
must be created and then changing the TrainJob
to point to the new version.
This provides control as to how changes to runtimes propagate to existing training jobs.
For example, when training is running for a long time (e.g. 1-2 months).
In the future implementation, we will introduce a revision control mechanism similar to
Kubernetes Deployment
to control versions of TrainingRuntime
and enable rolling updates.
We are going to create two CRDs: TrainingRuntime
and ClusterTrainingRuntime
. These runtimes have
exactly the same APIs, but the first one is the namespace-scoped, the second is the cluster-scoped.
If trainingRuntimeRef
in TrainJob
has the namespace, controller will use the TrainingRuntime
,
otherwise it will use the ClusterTrainingRuntime
.
type TrainingRuntime struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
// Framework specific parameters.
MLSpec *MLSpec `json:"mlSpec,omitempty"`
// Number of nodes to execute training.
NumNodes int `json:"numNodes,omitempty"`
// JobSet spec.
JobSetSpec *batchv1.JobSetSpec `json:",inline"`
// For gang-scheduling using volcano or scheduler plugins, supported for all frameworks.
GangScheduler *GangScheduler `json:"gangScheduler,omitempty"`
}
// One of the specs can be selected.
type MLSpec struct {
// Custom Spec for Torch
TorchSpec *TorchSpec `json:"torchSpec,omitempty"`
// Custom Spec for MPI
MPISpec *MPISpec `json:"mpiSpec,omitempty"`
}
Gang scheduler plugin is used to create appropriate PodGroup
for Volcano or scheduler plugins.
type GangScheduler struct {
// Plugin for gang scheduling.
Plugin *GangSchedulerPlugin `json:plugin,omitempty"`
// Time threshold to schedule PodGroup for gang scheduling.
ScheduleTimeoutSeconds string `json:scheduleTimeoutSeconds,omitempty"`
}
type GangSchedulerPlugin string
const (
GangSchedulerPluginVolcano GangSchedulerPlugin = "volcano"
GangSchedulerPlugins GangSchedulerPlugin = "scheduler-plugins"
)
The TorchSpec
API represents the configuration for the PyTorch distributed training. This configuration
allows platform engineers to explicitly configure torchrun
setting.
The distributed parameters are taken from the PyTorch distributed launch run.
For Elastic Training we will always pass the following parameters:
--rdzv-backend=c10d
--rdzv-id will be set automatically.
--rdzv-endpoint will always point to the node-0 Pod.
Since the etcd and etcd-v2 are legacy rendezvous,
we won't support them in TorchSpec
. We can introduce them in the future if users will require them.
// TorchSpec represents the configuration for PyTorch.
type TorchSpec struct {
// Number of Procs per Node.
NumProcPerNode int `json:"numProcPerNode,omitempty"`
// Used for single-node multi-worker training
Standalone bool `json:"standalone,omitempty"`
// Torch Elastic Policy.
ElasticPolicy *TorchElasticPolicy `json:"elasticPolicy,omitempty"`
}
// If the Elastic Policy is set, the numNodes parameter is ignored.
// --nnodes=minNodes:maxNodes
type TorchElasticPolicy struct {
// The limits to restart TrainJob.
// Insert it to the JobSet.spec.failurePolicy.maxRestarts
MaxRestarts *in32 `json:"maxRestarts,omitempty"`
// Min number of nodes for HPA and torchrun.
MinNodes *in32 `json:"minNodes,omitempty"`
// Max number of nodes for HPA and torchrun.
MaxNodes *in32 `json:"maxNodes,omitempty"`
// Metrics for scale up and down replicas.
Metrics []autoscalingv2.MetricSpec `json:"metrics,omitempty"`
}
The MPISpec
API represents the configuration for training using MPI orchestration.
E.g. creation of host-files and SSH keys. Using MPI might be more efficient for training on HPC
clusters or for some ML frameworks (e.g. MLX distributed with MPI).
We will fully migrate to the MPI Operator V2 functionality as part of this KEP. Check the proposal for the MPI V2 APIs.
type MPISpec struct {
// Number of Procs per Node.
NumProcPerNode int `json:"numProcPerNode,omitempty"`
// MPI Implementation to create appropriate host-files.
// Can be one of OpenMPI, Intel, or MPICH.
MPIImplementation MPIImplementation `json:"mpiImplementation,omitempty"`
// Directory where SSH keys are mounted.
SSHAuthMountPath string `json:"SSHAuthMountPath,omitempty"`
}
type MPIImplementation string
const (
MPIImplementationOpenMPI MPIImplementation = "OpenMPI"
MPIImplementationIntel MPIImplementation = "Intel"
MPIImplementationMPICH MPIImplementation = "MPICH"
)
Kubeflow community are planning to support the following runtimes.
Initially, we will maintain only multi-node multi-worker runtime and PyTorch Elastic.
apiVersion: kubeflow.org/v2alpha1
kind: ClusterTrainingRuntime
metadata:
name: torch-distributed-multi-node
spec:
mlSpec:
torch:
numProcPerNode: 5
replicatedJobs:
- name: node
template:
spec:
template:
spec:
containers:
- name: trainer
image: docker.io/kubeflow/pytorch-mnist
env:
- name: MASTER_ADDR
value: "pytorch-node-0-0.pytorch"
- name: MASTER_PORT
value: 29400
command:
- torchrun train.py
Example of usage:
apiVersion: kubeflow.org/v2alpha1
kind: TrainJob
metadata:
name: torch-test
namespace: tenant-alpha
spec:
trainingRuntimeRef:
name: torch-distributed-multi-node
trainer:
resourcesPerNode:
requests:
nvidia.com/gpu: 1
args:
- num-epochs=5
Training runtime for PyTorch Elastic:
apiVersion: kubeflow.org/v2alpha1
kind: ClusterTrainingRuntime
metadata:
name: torch-distributed-elastic
spec:
mlSpec:
torchSpec:
elasticPolicy:
minNodes: 5
maxNodes: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 80
replicatedJobs:
- name: node
template:
spec:
template:
spec:
containers:
- name: trainer
image: docker.io/kubeflow/pytorch-mnist
env:
- name: MASTER_ADDR
value: "pytorch-node-0-0.pytorch"
- name: MASTER_PORT
value: 29400
command:
- torchrun train.py
The following runtimes can be maintained in the future.
Single worker training:
apiVersion: kubeflow.org/v2alpha1
kind: ClusterTrainingRuntime
metadata:
name: torch-simple
spec:
replicatedJobs:
- name: node
template:
spec:
template:
spec:
containers:
- name: trainer
image: docker.io/kubeflow/pytorch-mnist
command:
- torchrun train.py
Single node multi worker training:
apiVersion: kubeflow.org/v2alpha1
kind: ClusterTrainingRuntime
metadata:
name: torch-distributed-single-worker
spec:
mlSpec:
torch:
numProcPerNode: 5
standalone: True
replicatedJobs:
- name: Node
template:
spec:
template:
spec:
containers:
- name: trainer
image: docker.io/kubeflow/pytorch-mnist
env:
- name: MASTER_ADDR
value: "pytorch-node-0-0.pytorch"
- name: MASTER_PORT
value: 29400
command:
- torchrun train.py
In the future, we can consider to use the torchtune
CLI
for Fine-Tuning with PyTorch.
The following runtime can be used for Llama 7b model.
apiVersion: kubeflow.org/v2alpha1
kind: ClusterTrainingRuntime
metadata:
name: torch-tune-llama-7b
spec:
numNodes: 1
startupPolicy:
startupPolicyOrder: InOrder
replicatedJobs:
- name: Initializer
template:
spec:
template:
spec:
containers:
- name: dataset-initializer
image: docker.io/kubeflow/dataset-initializer
env:
- name: STORAGE_URI
value: hf://tatsu-lab/alpaca
volumeMounts:
- mountPath: /workspace/dataset
name: dataset-initializer
- name: model-initializer
image: docker.io/kubeflow/model-initializer
env:
- name: STORAGE_URI
value: hf://meta-llama/Llama-2-7b
- name: TRANSFORMER_TYPE
value: AutoModelForCausalLM
volumeMounts:
- mountPath: /workspace/model
name: model-initializer
volumes:
- name: dataset-initializer
persistentVolumeClaim:
claimName: dataset-initializer
- name: model-initializer
persistentVolumeClaim:
claimName: model-initializer
- name: Node
template:
spec:
template:
spec:
containers:
- name: trainer
image: docker.io/kubeflow/llm-trainer
env:
- name: MASTER_ADDR
value: "pytorch-node-0-0.pytorch"
- name: MASTER_PORT
value: 29400
- name: TRANSFORMER_TYPE
value: AutoModelForCausalLM
- name: LORA_CONFIG
value: |
{"peft_type": "LORA", "r": 8, "lora_alpha": 16}
command:
- torchrun hf_llm_training.py
resources:
limits:
nvidia.com/gpu: 2
volumeMounts:
- mountPath: /workspace/dataset
name: dataset-initializer
- mountPath: /workspace/model
name: model-initializer
volumes:
- name: dataset-initializer
persistentVolumeClaim:
claimName: dataset-initializer
- name: model-initializer
persistentVolumeClaim:
claimName: model-initializer
The following runtime can be used for Gemma fine-tuning.
apiVersion: kubeflow.org/v2alpha1
kind: ClusterTrainingRuntime
metadata:
name: torch-tune-gemma-7b
spec:
numNodes: 1
startupPolicy:
startupPolicyOrder: InOrder
replicatedJobs:
- name: Initializer
template:
spec:
template:
spec:
containers:
- name: dataset-initializer
image: docker.io/kubeflow/dataset-initializer
env:
- name: STORAGE_URI
value: hf://tatsu-lab/alpaca
volumeMounts:
- mountPath: /workspace/dataset
name: dataset-initializer
- name: model-initializer
image: docker.io/kubeflow/model-initializer
env:
- name: STORAGE_URI
value: hf://google/gemma-7b
- name: TRANSFORMER_TYPE
value: AutoModelForCausalLM
volumeMounts:
- mountPath: /workspace/model
name: model-initializer
volumes:
- name: dataset-initializer
persistentVolumeClaim:
claimName: dataset-initializer
- name: model-initializer
persistentVolumeClaim:
claimName: model-initializer
- name: Node
template:
spec:
template:
spec:
containers:
- name: trainer
image: docker.io/kubeflow/llm-trainer
env:
- name: MASTER_ADDR
value: "pytorch-node-0-0.pytorch"
- name: MASTER_PORT
value: 29400
- name: TRANSFORMER_TYPE
value: AutoModelForCausalLM
- name: LORA_CONFIG
value: |
{"peft_type": "LORA", "r": 8, "lora_alpha": 16}
command:
- torchrun hf_llm_training.py
resources:
limits:
nvidia.com/gpu: 2
volumeMounts:
- mountPath: /workspace/dataset
name: dataset-initializer
- mountPath: /workspace/model
name: model-initializer
volumes:
- name: dataset-initializer
persistentVolumeClaim:
claimName: dataset-initializer
- name: model-initializer
persistentVolumeClaim:
claimName: model-initializer
For MPI, we can add support the DeepSpeed
runtimes.
Example of simple OpenMPI runtime:
apiVersion: kubeflow.org/v2alpha1
kind: ClusterTrainingRuntime
metadata:
name: mpi-simple
spec:
mlSpec:
mpi:
mpiImplementation: OpenMPI
numProcPerNode: 5
numNodes: 5
replicatedJobs:
- name: Launcher
template:
spec:
template:
spec:
containers:
- name: mpi-launcher
image: docker.io/mpi-launch
command:
- mpirun -np 5 --host mpi-simple.default.svc
- name: Node
template:
spec:
template:
spec:
containers:
- name: trainer
image: docker.io/mpi-training
command:
- mpirun -np 2 train.py
Will be added after initial implementation for PyTorch.
Will be added after initial implementation for PyTorch.
Will be added after initial implementation for PyTorch.
Will be added after initial implementation for PyTorch.
These API changes will not be compatible with Training Operator V1 APIs. Thus, existing users have to migrate to the newer APIs. Kubeflow community will provide instructions on how to migrate existing training jobs to the new APIs.
The following example shows how to migrate from PyTorchJob
to TrainingRuntime
:
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: pytorch-simple
namespace: kubeflow
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
imagePullPolicy: Always
command:
- "python3"
- "/opt/pytorch-mnist/mnist.py"
- "--epochs=1"
Worker:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
imagePullPolicy: Always
command:
- "python3"
- "/opt/pytorch-mnist/mnist.py"
- "--epochs=1"
apiVersion: kubeflow.org/v2alpha1
kind: TrainingRuntime
metadata:
name: torch-distributed-multi-node
spec:
numNodes: 2
replicatedJobs:
- name: node
template:
spec:
template:
spec:
containers:
- name: trainer
image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
env:
- name: MASTER_ADDR
value: "pytorch-node-0-0.pytorch"
- name: MASTER_PORT
value: 29400
command:
- torchrun train.py