-
Notifications
You must be signed in to change notification settings - Fork 781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training: Add Fine-Tune API Docs #3718
Changes from 5 commits
dbb5818
61a8f20
3ab2701
b64e336
b526390
cf1d0b5
7d30f12
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,172 @@ | ||||||
+++ | ||||||
title = "How to Fine-Tune LLM with Kubeflow" | ||||||
description = "Overview of LLM fine-tuning API in Training Operator" | ||||||
weight = 10 | ||||||
+++ | ||||||
|
||||||
{{% alert title="Warning" color="warning" %}} | ||||||
This feature is in **alpha** stage and Kubeflow community is looking for your feedback. Please | ||||||
share your experience using [#kubeflow-training-operator Slack channel](https://kubeflow.slack.com/archives/C985VJN9F) | ||||||
or [Kubeflow Training Operator GitHib](https://github.com/kubeflow/training-operator/issues/new). | ||||||
{{% /alert %}} | ||||||
|
||||||
In the rapidly evolving landscape of machine learning (ML) and artificial intelligence (AI), | ||||||
the ability to fine-tune pre-trained models represents a significant leap towards achieving custom | ||||||
solutions with less effort and time. Fine-tuning allows practitioners to adapt large language models | ||||||
(LLMs) like BERT or GPT to their specific needs by training these models on custom datasets. | ||||||
This process maintains the model's architecture and learned parameters while making it more relevant | ||||||
to particular applications. Whether you're working in natural language processing (NLP), | ||||||
image classification, or another ML domain, fine-tuning can drastically improve performance and | ||||||
applicability of pre-existing models to new datasets and problems. | ||||||
|
||||||
## Why Training Operator Fine-Tune API Matter ? | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I feel like this is out of place here. The how-to guide provides a step by step sequenced guide on how to achieve a very specific task. A how-to guide generally does not provide Reference or Explanation. It seems to me we are writing some paragraphs that would be more suited to an "Explanation" section. This is the fourth content types proposed by Diataxis - see here https://diataxis.fr/explanation/ I can very well see a page under "Explanation" titled "LLM Fine-Tune APIs in Kubeflow" where we discuss why we need it and how it fits into the ecosystem. Basically what you wrote already, plus a little bit of refactoring. WDYT? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That makes sense, but how user will map one guide to another ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's a very good question. In the how to guide, we can have something like: "If you want to learn more about how the fine tune API fit in the Kubeflow ecosystem, head to <...>". And in the exlanation guide, we can say something like: "Head to for a quick start tutorial on using LLM Fine-tune APIs. Head to for a reference architecture on the control plane implementation" And generally we can have links to how-tos in tutorials and reference guides. So in general, let's try to link related topics together when it makes sense for a user to follow that train of thought There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure, what do you think about it @StefanoFioravanzo ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looks great! |
||||||
|
||||||
Training Operator Python SDK introduction of Fine-Tune API is a game-changer for ML practitioners | ||||||
operating within the Kubernetes ecosystem. Historically, Training Operator has streamlined the | ||||||
orchestration of ML workloads on Kubernetes, making distributed training more accessible. However, | ||||||
fine-tuning tasks often require extensive manual intervention, including the configuration of | ||||||
training environments and the distribution of data across nodes. The Fine Tune APIs aim to simplify | ||||||
this process, offering an easy-to-use Python interface that abstracts away the complexity involved | ||||||
in setting up and executing fine-tuning tasks on distributed systems. | ||||||
|
||||||
## How to use Fine-Tuning API ? | ||||||
|
||||||
[Training Operator Python SDK](/docs/components/training/installation/#installing-training-python-sdk) | ||||||
implements a [`train` Python API](https://github.com/kubeflow/training-operator/blob/6ce4d57d699a76c3d043917bd0902c931f14080f/sdk/python/kubeflow/training/api/training_client.py#L112) | ||||||
that simplify ability to fine-tune LLMs with distributed PyTorchJob workers. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
You need to provide the following parameters to use `train` API: | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
- Pre-trained model parameters. | ||||||
- Dataset parameters. | ||||||
- Trainer parameters. | ||||||
- Number of PyTorch workers and resources per workers. | ||||||
|
||||||
For example, you can use `train` API as follows to fine-tune BERT model using Yelp Review dataset | ||||||
from HuggingFace Hub: | ||||||
|
||||||
```python | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If I copy paste this snippet into a notebook, does it run seamlessly? What are the required dependencies? Do we need to provide a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let me add the prerequisites to run this API. |
||||||
import transformers | ||||||
from peft import LoraConfig | ||||||
|
||||||
from kubeflow.training import TrainingClient | ||||||
from kubeflow.storage_initializer.hugging_face import ( | ||||||
HuggingFaceModelParams, | ||||||
HuggingFaceTrainerParams, | ||||||
HuggingFaceDatasetParams, | ||||||
) | ||||||
|
||||||
TrainingClient().train( | ||||||
name="fine-tune-bert", | ||||||
# BERT model URI and type of Transformer to train it. | ||||||
model_provider_parameters=HuggingFaceModelParams( | ||||||
model_uri="hf://google-bert/bert-base-cased", | ||||||
transformer_type=transformers.AutoModelForSequenceClassification, | ||||||
), | ||||||
# Use 3000 samples from Yelp dataset. | ||||||
dataset_provider_parameters=HuggingFaceDatasetParams( | ||||||
repo_id="yelp_review_full", | ||||||
split="train[:3000]", | ||||||
), | ||||||
# Specify HuggingFace Trainer parameters. In this example, we will skip evaluation and model checkpoints. | ||||||
trainer_parameters=HuggingFaceTrainerParams( | ||||||
training_parameters=transformers.TrainingArguments( | ||||||
output_dir="test_trainer", | ||||||
save_strategy="no", | ||||||
evaluation_strategy="no", | ||||||
do_eval=False, | ||||||
disable_tqdm=True, | ||||||
log_level="info", | ||||||
), | ||||||
# Set LoRA config to reduce number of trainable model parameters. | ||||||
lora_config=LoraConfig( | ||||||
r=8, | ||||||
lora_alpha=8, | ||||||
lora_dropout=0.1, | ||||||
bias="none", | ||||||
), | ||||||
), | ||||||
num_workers=4, # nnodes parameter for torchrun command. | ||||||
num_procs_per_worker=2, # nproc-per-node parameter for torchrun command. | ||||||
resources_per_worker={ | ||||||
"gpu": 2, | ||||||
"cpu": 5, | ||||||
"memory": "10G", | ||||||
}, | ||||||
) | ||||||
``` | ||||||
|
||||||
After you execute `train` API, Training Operator will orchestrate appropriate PyTorchJob resources | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
to fine-tune LLM. | ||||||
|
||||||
## Architecture | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This should go to a "Reference" |
||||||
|
||||||
In the following diagram you can see how `train` API works: | ||||||
|
||||||
<img src="/docs/components/training/images/fine-tune-llm-api.drawio.svg" | ||||||
alt="Fine-Tune API for LLMs" | ||||||
class="mt-3 mb-3"> | ||||||
|
||||||
- Once user executes `train` API, Training Operator creates PyTorchJob with appropriate resources | ||||||
to fine-tune LLM. | ||||||
|
||||||
- Storage initializer InitContainer is added to the PyTorchJob worker 0 to download | ||||||
pre-trained model and dataset with provided parameters. | ||||||
|
||||||
- PVC with [`ReadOnlyMany` access mode](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes) | ||||||
it attached to each PyTorchJob worker to distribute model and dataset across Pods. **Note**: Your | ||||||
Kubernetes cluster must support volumes with `ReadOnlyMany` access mode, otherwise you can use a | ||||||
single PyTorchJob worker. | ||||||
|
||||||
- Every PyTorchJob worker runs LLM Trainer that fine-tunes model using provided parameters. | ||||||
|
||||||
Training Operator implements `train` API with these pre-created components: | ||||||
|
||||||
### Model Provider | ||||||
|
||||||
Model provider downloads pre-trained model. Currently, Training Operator supports | ||||||
[HuggingFace model provider](https://github.com/kubeflow/training-operator/blob/6ce4d57d699a76c3d043917bd0902c931f14080f/sdk/python/kubeflow/storage_initializer/hugging_face.py#L56) | ||||||
that downloads model from HuggingFace Hub. | ||||||
|
||||||
You can implement your own model provider by using [this abstract base class](https://github.com/kubeflow/training-operator/blob/6ce4d57d699a76c3d043917bd0902c931f14080f/sdk/python/kubeflow/storage_initializer/abstract_model_provider.py#L4) | ||||||
|
||||||
### Dataset Provider | ||||||
|
||||||
Dataset provider downloads dataset. Currently, Training Operator supports | ||||||
[AWS S3](https://github.com/kubeflow/training-operator/blob/6ce4d57d699a76c3d043917bd0902c931f14080f/sdk/python/kubeflow/storage_initializer/s3.py#L37) | ||||||
and [HuggingFace](https://github.com/kubeflow/training-operator/blob/6ce4d57d699a76c3d043917bd0902c931f14080f/sdk/python/kubeflow/storage_initializer/hugging_face.py#L92) | ||||||
dataset providers. | ||||||
|
||||||
You can implement your own dataset provider by using [this abstract base class](https://github.com/kubeflow/training-operator/blob/6ce4d57d699a76c3d043917bd0902c931f14080f/sdk/python/kubeflow/storage_initializer/abstract_dataset_provider.py) | ||||||
|
||||||
### LLM Trainer | ||||||
|
||||||
Trainer implements training loop to fine-tune LLM. Currently, Training Operator supports | ||||||
[HuggingFace trainer](https://github.com/kubeflow/training-operator/blob/6ce4d57d699a76c3d043917bd0902c931f14080f/sdk/python/kubeflow/trainer/hf_llm_training.py#L118-L139) | ||||||
to fine-tune LLMs. | ||||||
|
||||||
You can implement your own trainer for other ML use-cases such as image classification, | ||||||
voice recognition, etc. | ||||||
|
||||||
## User Value for this Feature | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we can just fold this under |
||||||
|
||||||
Different user personas can benefit from this feature: | ||||||
|
||||||
- **MLOps Engineers:** Can leverage these APIs to automate and streamline the setup and execution of | ||||||
fine-tuning tasks, reducing operational overhead. | ||||||
|
||||||
- **Data Scientists:** Can focus more on model experimentation and less on the logistical aspects of | ||||||
distributed training, speeding up the iteration cycle. | ||||||
|
||||||
- **Business Owners:** Can expect quicker turnaround times for tailored ML solutions, enabling faster | ||||||
response to market needs or operational challenges. | ||||||
|
||||||
- **Platform Engineers:** Can utilize these APIs to better operationalize the ML toolkit, ensuring | ||||||
scalability and efficiency in managing ML workflows. | ||||||
|
||||||
## Next Steps | ||||||
|
||||||
- Run example to [fine-tune TinyLlama LLM](https://github.com/kubeflow/training-operator/blob/6ce4d57d699a76c3d043917bd0902c931f14080f/examples/pytorch/language-modeling/train_api_hf_dataset.ipynb) | ||||||
|
||||||
- Check this example to compare `create_job` and `train` Python API for | ||||||
[fine-tuning BERT LLM](https://github.com/kubeflow/training-operator/blob/6ce4d57d699a76c3d043917bd0902c931f14080f/examples/pytorch/text-classification/Fine-Tune-BERT-LLM.ipynb). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.