-
Notifications
You must be signed in to change notification settings - Fork 781
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Training: Add Fine-Tune API Docs (#3718)
* Add Train API for LLMs Architecture Signed-off-by: Andrey Velichkevich <[email protected]> * Training: Add Fine-Tune API Docs Signed-off-by: Andrey Velichkevich <[email protected]> * Add content for Train API Signed-off-by: Andrey Velichkevich <[email protected]> * Add PVC notice Signed-off-by: Andrey Velichkevich <[email protected]> * Fix BERT example Signed-off-by: Andrey Velichkevich <[email protected]> * Add reference architecture Signed-off-by: Andrey Velichkevich <[email protected]> * Add Explanation for Fine-Tuning API Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]>
- Loading branch information
1 parent
57e57eb
commit 36544ae
Showing
6 changed files
with
227 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
+++ | ||
title = "Explanation" | ||
description = "Explanation for Training Operator Features" | ||
weight = 60 | ||
+++ |
63 changes: 63 additions & 0 deletions
63
content/en/docs/components/training/explanation/fine-tuning.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
+++ | ||
title = "LLM Fine-Tuning with Training Operator" | ||
description = "Why Training Operator needs fine-tuning API" | ||
weight = 10 | ||
+++ | ||
|
||
{{% alert title="Warning" color="warning" %}} | ||
This feature is in **alpha** stage and Kubeflow community is looking for your feedback. Please | ||
share your experience using [#kubeflow-training-operator Slack channel](https://kubeflow.slack.com/archives/C985VJN9F) | ||
or [Kubeflow Training Operator GitHib](https://github.com/kubeflow/training-operator/issues/new). | ||
{{% /alert %}} | ||
|
||
This page explains how [Training Operator fine-tuning API](/docs/components/training/user-guides/fine-tuning) | ||
fits into Kubeflow ecosystem. | ||
|
||
In the rapidly evolving landscape of machine learning (ML) and artificial intelligence (AI), | ||
the ability to fine-tune pre-trained models represents a significant leap towards achieving custom | ||
solutions with less effort and time. Fine-tuning allows practitioners to adapt large language models | ||
(LLMs) like BERT or GPT to their specific needs by training these models on custom datasets. | ||
This process maintains the model's architecture and learned parameters while making it more relevant | ||
to particular applications. Whether you're working in natural language processing (NLP), | ||
image classification, or another ML domain, fine-tuning can drastically improve performance and | ||
applicability of pre-existing models to new datasets and problems. | ||
|
||
## Why Training Operator Fine-Tune API Matter ? | ||
|
||
Training Operator Python SDK introduction of Fine-Tune API is a game-changer for ML practitioners | ||
operating within the Kubernetes ecosystem. Historically, Training Operator has streamlined the | ||
orchestration of ML workloads on Kubernetes, making distributed training more accessible. However, | ||
fine-tuning tasks often require extensive manual intervention, including the configuration of | ||
training environments and the distribution of data across nodes. The Fine-Tune API aim to simplify | ||
this process, offering an easy-to-use Python interface that abstracts away the complexity involved | ||
in setting up and executing fine-tuning tasks on distributed systems. | ||
|
||
## The Rationale Behind Kubeflow's Fine-Tune API | ||
|
||
Implementing Fine-Tune API within Training Operator is a logical step in enhancing the platform's | ||
capabilities. By providing this API, Training Operator not only simplifies the user experience for | ||
ML practitioners but also leverages its existing infrastructure for distributed training. | ||
This approach aligns with Kubeflow's mission to democratize distributed ML training, making it more | ||
accessible and less cumbersome for users. The API facilitate a seamless transition from model | ||
development to deployment, supporting the fine-tuning of LLMs on custom datasets without the need | ||
for extensive manual setup or specialized knowledge of Kubernetes internals. | ||
|
||
## Roles and Interests | ||
|
||
Different user personas can benefit from this feature: | ||
|
||
- **MLOps Engineers:** Can leverage this API to automate and streamline the setup and execution of | ||
fine-tuning tasks, reducing operational overhead. | ||
|
||
- **Data Scientists:** Can focus more on model experimentation and less on the logistical aspects of | ||
distributed training, speeding up the iteration cycle. | ||
|
||
- **Business Owners:** Can expect quicker turnaround times for tailored ML solutions, enabling faster | ||
response to market needs or operational challenges. | ||
|
||
- **Platform Engineers:** Can utilize this API to better operationalize the ML toolkit, ensuring | ||
scalability and efficiency in managing ML workflows. | ||
|
||
## Next Steps | ||
|
||
- Understand [the architecture behind `train` API](/docs/components/training/reference/fine-tuning). |
4 changes: 4 additions & 0 deletions
4
content/en/docs/components/training/images/fine-tune-llm-api.drawio.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
57 changes: 57 additions & 0 deletions
57
content/en/docs/components/training/reference/fine-tuning.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,57 @@ | ||
+++ | ||
title = "LLM Fine-Tuning with Training Operator" | ||
description = "How Training Operator performs fine-tuning on Kubernetes" | ||
weight = 10 | ||
+++ | ||
|
||
This page shows how Training Operator implements the | ||
[API to fine-tune LLMs](/docs/components/training/user-guides/fine-tuning). | ||
|
||
## Architecture | ||
|
||
In the following diagram you can see how `train` Python API works: | ||
|
||
<img src="/docs/components/training/images/fine-tune-llm-api.drawio.svg" | ||
alt="Fine-Tune API for LLMs" | ||
class="mt-3 mb-3"> | ||
|
||
- Once user executes `train` API, Training Operator creates PyTorchJob with appropriate resources | ||
to fine-tune LLM. | ||
|
||
- Storage initializer InitContainer is added to the PyTorchJob worker 0 to download | ||
pre-trained model and dataset with provided parameters. | ||
|
||
- PVC with [`ReadOnlyMany` access mode](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes) | ||
it attached to each PyTorchJob worker to distribute model and dataset across Pods. **Note**: Your | ||
Kubernetes cluster must support volumes with `ReadOnlyMany` access mode, otherwise you can use a | ||
single PyTorchJob worker. | ||
|
||
- Every PyTorchJob worker runs LLM Trainer that fine-tunes model using provided parameters. | ||
|
||
Training Operator implements `train` API with these pre-created components: | ||
|
||
### Model Provider | ||
|
||
Model provider downloads pre-trained model. Currently, Training Operator supports | ||
[HuggingFace model provider](https://github.com/kubeflow/training-operator/blob/6ce4d57d699a76c3d043917bd0902c931f14080f/sdk/python/kubeflow/storage_initializer/hugging_face.py#L56) | ||
that downloads model from HuggingFace Hub. | ||
|
||
You can implement your own model provider by using [this abstract base class](https://github.com/kubeflow/training-operator/blob/6ce4d57d699a76c3d043917bd0902c931f14080f/sdk/python/kubeflow/storage_initializer/abstract_model_provider.py#L4) | ||
|
||
### Dataset Provider | ||
|
||
Dataset provider downloads dataset. Currently, Training Operator supports | ||
[AWS S3](https://github.com/kubeflow/training-operator/blob/6ce4d57d699a76c3d043917bd0902c931f14080f/sdk/python/kubeflow/storage_initializer/s3.py#L37) | ||
and [HuggingFace](https://github.com/kubeflow/training-operator/blob/6ce4d57d699a76c3d043917bd0902c931f14080f/sdk/python/kubeflow/storage_initializer/hugging_face.py#L92) | ||
dataset providers. | ||
|
||
You can implement your own dataset provider by using [this abstract base class](https://github.com/kubeflow/training-operator/blob/6ce4d57d699a76c3d043917bd0902c931f14080f/sdk/python/kubeflow/storage_initializer/abstract_dataset_provider.py) | ||
|
||
### LLM Trainer | ||
|
||
Trainer implements training loop to fine-tune LLM. Currently, Training Operator supports | ||
[HuggingFace trainer](https://github.com/kubeflow/training-operator/blob/6ce4d57d699a76c3d043917bd0902c931f14080f/sdk/python/kubeflow/trainer/hf_llm_training.py#L118-L139) | ||
to fine-tune LLMs. | ||
|
||
You can implement your own trainer for other ML use-cases such as image classification, | ||
voice recognition, etc. |
97 changes: 97 additions & 0 deletions
97
content/en/docs/components/training/user-guides/fine-tuning.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,97 @@ | ||
+++ | ||
title = "How to Fine-Tune LLMs with Kubeflow" | ||
description = "Overview of LLM fine-tuning API in Training Operator" | ||
weight = 10 | ||
+++ | ||
|
||
{{% alert title="Warning" color="warning" %}} | ||
This feature is in **alpha** stage and Kubeflow community is looking for your feedback. Please | ||
share your experience using [#kubeflow-training-operator Slack channel](https://kubeflow.slack.com/archives/C985VJN9F) | ||
or [Kubeflow Training Operator GitHib](https://github.com/kubeflow/training-operator/issues/new). | ||
{{% /alert %}} | ||
|
||
This page describes how to use a [`train` API from Training Python SDK](https://github.com/kubeflow/training-operator/blob/6ce4d57d699a76c3d043917bd0902c931f14080f/sdk/python/kubeflow/training/api/training_client.py#L112) that simplifies the ability to fine-tune LLMs with | ||
distributed PyTorchJob workers. | ||
|
||
If you want to learn more about how the fine-tuning API fit in the Kubeflow ecosystem, head to | ||
[explanation guide](/docs/components/training/explanation/fine-tuning). | ||
|
||
## Prerequisites | ||
|
||
You need to install Training Python SDK [with fine-tuning support](/docs/components/training/installation/#install-python-sdk-with-fine-tuning-capabilities) | ||
to run this API. | ||
|
||
## How to use Fine-Tuning API ? | ||
|
||
You need to provide the following parameters to use the `train` API: | ||
|
||
- Pre-trained model parameters. | ||
- Dataset parameters. | ||
- Trainer parameters. | ||
- Number of PyTorch workers and resources per workers. | ||
|
||
For example, you can use `train` API as follows to fine-tune BERT model using Yelp Review dataset | ||
from HuggingFace Hub: | ||
|
||
```python | ||
import transformers | ||
from peft import LoraConfig | ||
|
||
from kubeflow.training import TrainingClient | ||
from kubeflow.storage_initializer.hugging_face import ( | ||
HuggingFaceModelParams, | ||
HuggingFaceTrainerParams, | ||
HuggingFaceDatasetParams, | ||
) | ||
|
||
TrainingClient().train( | ||
name="fine-tune-bert", | ||
# BERT model URI and type of Transformer to train it. | ||
model_provider_parameters=HuggingFaceModelParams( | ||
model_uri="hf://google-bert/bert-base-cased", | ||
transformer_type=transformers.AutoModelForSequenceClassification, | ||
), | ||
# Use 3000 samples from Yelp dataset. | ||
dataset_provider_parameters=HuggingFaceDatasetParams( | ||
repo_id="yelp_review_full", | ||
split="train[:3000]", | ||
), | ||
# Specify HuggingFace Trainer parameters. In this example, we will skip evaluation and model checkpoints. | ||
trainer_parameters=HuggingFaceTrainerParams( | ||
training_parameters=transformers.TrainingArguments( | ||
output_dir="test_trainer", | ||
save_strategy="no", | ||
evaluation_strategy="no", | ||
do_eval=False, | ||
disable_tqdm=True, | ||
log_level="info", | ||
), | ||
# Set LoRA config to reduce number of trainable model parameters. | ||
lora_config=LoraConfig( | ||
r=8, | ||
lora_alpha=8, | ||
lora_dropout=0.1, | ||
bias="none", | ||
), | ||
), | ||
num_workers=4, # nnodes parameter for torchrun command. | ||
num_procs_per_worker=2, # nproc-per-node parameter for torchrun command. | ||
resources_per_worker={ | ||
"gpu": 2, | ||
"cpu": 5, | ||
"memory": "10G", | ||
}, | ||
) | ||
``` | ||
|
||
After you execute `train`, Training Operator will orchestrate appropriate PyTorchJob resources | ||
to fine-tune LLM. | ||
|
||
## Next Steps | ||
|
||
- Run example to [fine-tune TinyLlama LLM](https://github.com/kubeflow/training-operator/blob/6ce4d57d699a76c3d043917bd0902c931f14080f/examples/pytorch/language-modeling/train_api_hf_dataset.ipynb) | ||
|
||
- Check this example to compare `create_job` and `train` Python API for | ||
[fine-tuning BERT LLM](https://github.com/kubeflow/training-operator/blob/6ce4d57d699a76c3d043917bd0902c931f14080f/examples/pytorch/text-classification/Fine-Tune-BERT-LLM.ipynb). | ||
|
||
- Understand [the architecture behind `train` API](/docs/components/training/reference/fine-tuning). |