Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Go through the Mistral example and improve it a bit #8

Merged
merged 39 commits into from
Nov 28, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
14adb7e
Make README images work with non-GitHub Markdown renderers
ruksi Nov 14, 2024
2ac18c4
Add alt texts to the README images
ruksi Nov 14, 2024
2841a34
Use title case on README headings
ruksi Nov 14, 2024
dc81a72
Be a bit more clear in web app vs. command-line client
ruksi Nov 14, 2024
811e51b
Don't instruct to create two projects if using the terminal
ruksi Nov 14, 2024
bc7d414
Only instruct to install `valohai-cli` as the bare minimum
ruksi Nov 14, 2024
5117c95
Remove extra whitespace on the inference explanation
ruksi Nov 14, 2024
40ffd0b
Make blank lines a bit more consistent
ruksi Nov 14, 2024
cbb300c
Consistent plurals in headers
ruksi Nov 14, 2024
a59378c
Improve the section introductions a bit
ruksi Nov 14, 2024
1d62f7d
Highlight the final remark a bit more
ruksi Nov 14, 2024
00c8f81
Use inputs directory path helper in data preprocessing
ruksi Nov 14, 2024
22fee2a
Remove unnecessary default value in save dataset
ruksi Nov 14, 2024
c5af4a7
Use `dir_path` to get input directory name
ruksi Nov 14, 2024
f74ecfb
Fix deprecation warnings on training params
ruksi Nov 14, 2024
2c79ebf
Clean the variable names a bit
ruksi Nov 14, 2024
bf2ea53
Fix output path so it works locally too
ruksi Nov 14, 2024
098bacb
Remove unused argument `model_path` from inference
ruksi Nov 14, 2024
2eeeedc
Use `valohai.inputs` to get the checkpoint dir if not specified
ruksi Nov 14, 2024
0fb35cb
Make the inference flow a bit more logical
ruksi Nov 14, 2024
72b4ddd
Make all the mains feel similar
ruksi Nov 14, 2024
22756dc
Capitalize Hugging Face
ruksi Nov 14, 2024
7ffeaf5
Remove extra blank lines
ruksi Nov 14, 2024
abc3046
Lock dependency versions
ruksi Nov 15, 2024
65d687c
Add Dockerfile
ruksi Nov 15, 2024
a4502a2
Upgrade Docker images
ruksi Nov 15, 2024
7ba88d2
Make prompt format more consistent
ruksi Nov 15, 2024
4549da5
Make YAML prompt param a string literal
ruksi Nov 15, 2024
38d325a
Rename "Steps" section to "Overview"
ruksi Nov 15, 2024
8d6c0d8
Close the detail boxes by default
ruksi Nov 15, 2024
624d19c
Use emojis to highlight the actionable sections
ruksi Nov 15, 2024
119f3d2
Add guidance how to configure the Hugging Face API access
ruksi Nov 15, 2024
6ba2a72
Fix README lint errors, mainly indentation
ruksi Nov 15, 2024
28114fc
Add a proper preface to the setup section
ruksi Nov 15, 2024
ce2878e
Prefer reading prompt from Valohai `parameters.json`
ruksi Nov 19, 2024
ddca6b4
Mark the requirements and Docker image as GPU
ruksi Nov 19, 2024
c0e1d63
DRY args and prompting to get more consistent results
ruksi Nov 19, 2024
de71abb
Make all screenshot URLs relative
ruksi Nov 27, 2024
8ee63bf
Move all screenshots under .github
ruksi Nov 27, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added .github/screenshots/hf_access_token_page.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added .github/screenshots/hf_agree_to_terms.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added .github/screenshots/hf_create_token.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added .github/screenshots/hf_get_token.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added .github/screenshots/vh_project_env_vars.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,3 @@
.DS_Store
.idea
.valohai
Dockerfile
40 changes: 40 additions & 0 deletions DEVELOPMENT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# GPU Environments

## Dependencies

Resolve and lock dependencies on GPU environments:

```bash
uv pip compile requirements.in -o requirements-gpu.txt
```

## Docker Image

Build the GPU enabled Docker image:

```bash
docker build -f Dockerfile.gpu -t llm-toolkit:dev-gpu .
```

Smoke test the Docker image:

```bash
docker run -it --rm -v $(pwd):/workspace llm-toolkit:dev-gpu /bin/bash
python -c "import torch; print(torch.__version__)"
python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('we love you'))"
```

Release a new version of the GPU enabled Docker image:

```bash
export LLM_TOOLKIT_VERSION=0.2-gpu
docker tag llm-toolkit:dev-gpu valohai/llm-toolkit:$LLM_TOOLKIT_VERSION
docker push valohai/llm-toolkit:$LLM_TOOLKIT_VERSION
```

Cleanup:

```bash
docker rmi valohai/llm-toolkit:$LLM_TOOLKIT_VERSION
docker rmi llm-toolkit:dev-gpu
```
14 changes: 14 additions & 0 deletions Dockerfile.gpu
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
FROM pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime

ENV PYTHONUNBUFFERED=1 \
PIP_DISABLE_PIP_VERSION_CHECK=1 \
PIP_ROOT_USER_ACTION=ignore

WORKDIR /workspace

COPY requirements-gpu.txt .

RUN pip install \
--no-cache-dir \
-r requirements-gpu.txt \
&& rm requirements-gpu.txt
142 changes: 92 additions & 50 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,48 +1,53 @@
# Mistral fine-tuning with Valohai
# Mistral Fine-Tuning with Valohai

This project serves as an on-ramp to [Valohai][vh] and is designed to be the first step for individuals starting with their self-serve trial.
The primary goal of this template is to showcase the power of Valohai for fine-tuning large language models, with a special focus on the Mistral 7B model.


[vh]: https://valohai.com/
[app]: https://app.valohai.com
## <div align="center">Steps</div>
[hf_login]: https://huggingface.co/login
[hf_mistral]: https://huggingface.co/mistralai/Mistral-7B-v0.1

## <div align="center">Overview</div>

### **Data Preprocessing**:

* **Loading Data**:
In our project, data is seamlessly fetched from our S3 bucket.
When you initiate an execution, the data is automatically stored in the `/valohai/inputs/` directory on the machine. Additionally, the tokenizer is sourced directly from the Hugging Face repository and it is also available in `/valohai/inputs/` directory.
When you initiate an execution, the data is automatically stored in the `/valohai/inputs/` directory on the machine. Additionally, the tokenizer is sourced directly from the Hugging Face repository, and it is also available in `/valohai/inputs/` directory.

* **Tokenization**: To make the data suitable for language models, it's tokenized using the tokenizer from Hugging Face's Mistral repository. Tokenization basically means breaking down the text into smaller units, like words or subwords, so that the model can work with it.

* **Saving Processed Data**: After tokenization, the processed data is saved in a way that makes it easy to use later. This processed data is saved to Valohai datasets with a special alias, making it convenient for further steps in the machine learning process.

This streamlined workflow empowers you to focus on your machine learning tasks, while Valohai handles data management, versioning, and efficient storage.

### **Model fine-tuning**:
### **Model Fine-Tuning**:

* **Loading Data and Model**: The code loads the prepared training data from Valohai datasets. It also fetches the base model from an S3 bucket. This base model is a pre-trained Mistral model.

* **Model Enhancement**: The base model is enhanced to make it better for training with a method called "PEFT." This enhancement involves configuring the model for better training performance.

* **Training the Model**: The script then trains the model using the prepared data using the Trainer from transformers library. It fine-tunes the model, making it better at understanding video gaming text.
* **Training the Model**: The script then trains the model using the prepared data using Trainer from the Transformers library. It fine-tunes the model, making it better at understanding video gaming text.

Saving Results: After training, the script saves checkpoints of the model's progress. These checkpoints are stored in Valohai datasets for easy access in the next steps, like inference.

### **Model inference**:
### **Model Inference**:

In the inference step, we use the fine-tuned language model to generate text based on a given prompt. Here's a simplified explanation of what happens in this code:

* **Loading Model and Checkpoints**: The code loads the base model from an S3 bucket and the fine-tuned checkpoint from the previous step, which is stored in Valohai datasets.

* **Inference** : Using the fine-tuned model and provided test prompt, we obtain a model-generated response, which is decoded by tokenizer to make it human-readable.
* **Inference**: Using the fine-tuned model and provided test prompt, we get a model-generated response, which tokenizer decodes to make it human-readable.

## <div align="center">Setup</div>

## <div align="center">Installation</div>
Before we can run any code, we need to set up the project. This section explains how to set up the project using the Valohai web app or the terminal.

Login to the [Valohai app][app] and create a new project.
<details>
<summary>🌐 Using the web app</summary>

### Configure the repository:
<details open>
<summary>Using UI</summary>
Login to [the Valohai web app][app] and create a new project.

Configure this repository as the project's repository, by following these steps:

Expand All @@ -53,58 +58,93 @@ Configure this repository as the project's repository, by following these steps:
5. Click on the Save button to save the changes.
</details>

<details open>
<summary>Using terminal</summary>
<details>
<summary>⌨️ Using the terminal</summary>

To run your code on Valohai using the terminal, follow these steps:

1. Install Valohai on your machine by running the following command:
```bash
pip install valohai-cli valohai-utils
```

```bash
pip install valohai-cli
```

2. Log in to Valohai from the terminal using the command:
```bash
vh login
```

```bash
vh login
```

3. Create a project for your Valohai workflow.
Start by creating a directory for your project:
```bash
mkdir valohai-mistral-example
cd valohai-mistral-example
```

Then, create the Valohai project:
```bash
vh project create
```
```bash
mkdir valohai-mistral-example
cd valohai-mistral-example
```

Then, create the Valohai project:
```bash
vh project create
```

4. Clone the repository to your local machine:
```bash
git clone https://github.com/valohai/mistral-example.git .
```

```bash
git clone https://github.com/valohai/mistral-example.git .
```

</details>

Now you are ready to run executions and pipelines.
<details>
<summary>🌐 / ⌨️ Setup for both</summary>

Authorize the Valohai project to download models and tokenizers from Hugging Face.

1. Login to [the Hugging Face platform][hf_login]

2. Agree on [the terms of Mistral model][hf_mistral]; the license is Apache 2.

![Agree to the terms set by Mistral to use their models](.github/screenshots/hf_agree_to_terms.png)

3. Create an access token under Hugging Face settings.

![Access token controls under Hugging Face settings](.github/screenshots/hf_access_token_page.png)

![Access token creation form under Hugging Face settings](.github/screenshots/hf_create_token.png)

_You can either choose to allow access to all public models you've agreed to or only the Mistral model._

Copy the token and store it in a secure place, you won't be seeing it again.

![Copy the token for later use](.github/screenshots/hf_get_token.png)

4. Add the `hf_xxx` token to your Valohai project as a secret named `HF_TOKEN`.

![Valohai project environmental variable configuration page](.github/screenshots/vh_project_env_vars.png)

Now all workloads on this project have scoped access to Hugging Face if you don't specifically restrict them.

</details>

## <div align="center">Running Executions</div>
This repository covers essential tasks such as data preprocessing, model fine-tuning and inference using Mistral model.
<details open>
<summary>Using UI</summary>

This repository defines the essential tasks or "steps" like data preprocessing, model fine-tuning and inference of Mistral models. You can execute these tasks individually or as part of a pipeline. This section covers how you can run them individually.

<details>
<summary>🌐 Using the web app</summary>

1. Go to the Executions tab in your project.
2. Create a new execution by selecting the predefined steps: _data-preprocess_, _finetune_, _inference_.
3. Customize the execution parameters if needed.
4. Start the execution to run the selected step.

![alt text](https://github.com/valohai/mistral-example/blob/main/screenshots/create_execution.jpeg)
![Create execution page on Valohai UI](.github/screenshots/create_execution.jpeg)

</details>

<details open>
<summary>Using terminal</summary>
<details>
<summary>⌨️ Using the terminal</summary>

To run individual steps, execute the following command:
```bash
Expand All @@ -118,23 +158,25 @@ vh execution run data-preprocess --adhoc

</details>

## <div align="center">Running Pipeline</div>
## <div align="center">Running Pipelines</div>

<details open>
<summary>Using UI</summary>
When you have a collection of tasks that you want to run together, you create a pipeline. This section explains how to run the predefined pipelines in this repository.

<details>
<summary>🌐 Using the web app</summary>

1. Navigate to the Pipelines tab.
2. Create a new pipeline and select out the blueprint _training-pipeline_.
3. Create pipeline from template.
3. Create a pipeline from template.
4. Configure the pipeline settings.
5. Create pipeline.

![alt text](https://github.com/valohai/mistral-example/blob/main/screenshots/create_pipeline.jpeg)
![Choosing of pipeline blueprint on Valohai UI](.github/screenshots/create_pipeline.jpeg)

</details>

<details open>
<summary>Using terminal</summary>
<details>
<summary>⌨️ Using the terminal</summary>

To run pipelines, use the following command:

Expand All @@ -150,11 +192,11 @@ vh pipeline run training-pipeline --adhoc

The completed pipeline view:

![alt text](https://github.com/valohai/mistral-example/blob/main/screenshots/completed_pipeline.jpeg)

![Graph of the completed pipeline on Valohai UI](.github/screenshots/completed_pipeline.jpeg)

The generated response by the model looks like this:

![alt text](https://github.com/valohai/mistral-example/blob/main/screenshots/inference_result.jpeg)
![Showcasing the LLM responses inside a Valohai execution](.github/screenshots/inference_result.jpeg)

We need to consider that the model underwent only a limited number of fine-tuning steps, so achieving satisfactory results might necessitate further experimentation with model parameters.
> [!IMPORTANT]
> The example configuration undergoes only a limited number of fine-tuning steps. To achieve satisfactory results might require further experimentation with model parameters.
41 changes: 13 additions & 28 deletions data-preprocess.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,16 +5,15 @@

import valohai
from datasets import load_dataset
from transformers import AutoTokenizer

from helpers import get_run_identification
from helpers import get_run_identification, get_tokenizer, promptify


class DataPreprocessor:
def __init__(self, args):
self.data_path = args.data_path or os.path.dirname(valohai.inputs('dataset').path())
self.model_max_length = args.model_max_length
self.tokenizer = args.tokenizer
self.data_path = args.data_path or valohai.inputs('dataset').dir_path()
self.model_id = args.model_id
self.max_tokens = args.max_tokens
dataset = load_dataset(
'csv',
data_files={
Expand All @@ -33,30 +32,18 @@ def prepare_datasets(self, generate_and_tokenize_prompt):
return tknzd_train_dataset, tknzd_val_dataset

def generate_and_tokenize_prompt(self, data_point, tokenizer):
full_prompt = f"""Given a meaning representation generate a target sentence that utilizes the attributes and attribute values given. The sentence should use all the information provided in the meaning representation.
### Target sentence:
{data_point["ref"]}

### Meaning representation:
{data_point["mr"]}
"""
return tokenizer(full_prompt, truncation=True, max_length=self.model_max_length, padding='max_length')
prompt = promptify(sentence=data_point['ref'], meaning=data_point['mr'])
return tokenizer(prompt, truncation=True, max_length=self.max_tokens)

def load_and_prepare_data(self):
tokenizer = AutoTokenizer.from_pretrained(
self.tokenizer,
model_max_length=self.model_max_length,
padding_side='left',
add_eos_token=True,
)
tokenizer.pad_token = tokenizer.eos_token
tokenizer = get_tokenizer(self.model_id, self.max_tokens)
tokenized_train_dataset, tokenized_val_dataset = self.prepare_datasets(
lambda data_point: self.generate_and_tokenize_prompt(data_point, tokenizer),
)
return tokenized_train_dataset, tokenized_val_dataset, self.test_dataset

@staticmethod
def save_dataset(dataset, tag='train'):
def save_dataset(dataset, tag):
project_name, exec_id = get_run_identification()

metadata = {
Expand All @@ -79,19 +66,17 @@ def save_dataset(dataset, tag='train'):

def main():
logging.basicConfig(level=logging.INFO)
parser = argparse.ArgumentParser(description='Prepare data')

# Add arguments based on your script's needs
parser = argparse.ArgumentParser(description='Prepare data')
# fmt: off
parser.add_argument('--data_path', type=str, default=None)
parser.add_argument('--tokenizer', type=str, default='mistralai/Mistral-7B-v0.1', help='Huggingface tokenizer link')
parser.add_argument('--model_max_length', type=int, default=512, help='Maximum length for the model')

parser.add_argument('--model_id', type=str, default="mistralai/Mistral-7B-v0.1", help="Model identifier from Hugging Face, also defines the tokenizer")
parser.add_argument('--max_tokens', type=int, default=512, help="The maximum number of tokens that the model can process in a single forward pass")
# fmt: on
args = parser.parse_args()

data_preprocessor = DataPreprocessor(args)

tokenized_train_dataset, tokenized_val_dataset, test_dataset = data_preprocessor.load_and_prepare_data()

data_preprocessor.save_dataset(tokenized_train_dataset, 'train')
data_preprocessor.save_dataset(tokenized_val_dataset, 'val')
data_preprocessor.save_dataset(test_dataset, 'test')
Expand Down
Loading