valohai · ruksi · Nov 28, 2024 · Nov 14, 2024 · Nov 14, 2024 · Nov 14, 2024
diff --git a/screenshots/completed_pipeline.jpeg → .github/screenshots/completed_pipeline.jpeg b/screenshots/completed_pipeline.jpeg → .github/screenshots/completed_pipeline.jpeg
diff --git a/screenshots/create_execution.jpeg → .github/screenshots/create_execution.jpeg b/screenshots/create_execution.jpeg → .github/screenshots/create_execution.jpeg
diff --git a/screenshots/create_pipeline.jpeg → .github/screenshots/create_pipeline.jpeg b/screenshots/create_pipeline.jpeg → .github/screenshots/create_pipeline.jpeg
diff --git a/.github/screenshots/hf_access_token_page.png b/.github/screenshots/hf_access_token_page.png
diff --git a/.github/screenshots/hf_agree_to_terms.png b/.github/screenshots/hf_agree_to_terms.png
diff --git a/.github/screenshots/hf_create_token.png b/.github/screenshots/hf_create_token.png
diff --git a/.github/screenshots/hf_get_token.png b/.github/screenshots/hf_get_token.png
diff --git a/screenshots/inference_result.jpeg → .github/screenshots/inference_result.jpeg b/screenshots/inference_result.jpeg → .github/screenshots/inference_result.jpeg
diff --git a/.github/screenshots/vh_project_env_vars.png b/.github/screenshots/vh_project_env_vars.png
diff --git a/.gitignore b/.gitignore
@@ -2,4 +2,3 @@
 .DS_Store
 .idea
 .valohai
-Dockerfile
diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md
@@ -0,0 +1,40 @@
+# GPU Environments
+
+## Dependencies
+
+Resolve and lock dependencies on GPU environments:
+
+```bash
+uv pip compile requirements.in -o requirements-gpu.txt
+```
+
+## Docker Image
+
+Build the GPU enabled Docker image:
+
+```bash
+docker build -f Dockerfile.gpu -t llm-toolkit:dev-gpu .
+```
+
+Smoke test the Docker image:
+
+```bash
+docker run -it --rm -v $(pwd):/workspace llm-toolkit:dev-gpu /bin/bash
+python -c "import torch; print(torch.__version__)"
+python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('we love you'))"
+```
+
+Release a new version of the GPU enabled Docker image:
+
+```bash
+export LLM_TOOLKIT_VERSION=0.2-gpu
+docker tag llm-toolkit:dev-gpu valohai/llm-toolkit:$LLM_TOOLKIT_VERSION
+docker push valohai/llm-toolkit:$LLM_TOOLKIT_VERSION
+```
+
+Cleanup:
+
+```bash
+docker rmi valohai/llm-toolkit:$LLM_TOOLKIT_VERSION
+docker rmi llm-toolkit:dev-gpu
+```
diff --git a/Dockerfile.gpu b/Dockerfile.gpu
@@ -0,0 +1,14 @@
+FROM pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime
+
+ENV PYTHONUNBUFFERED=1 \
+    PIP_DISABLE_PIP_VERSION_CHECK=1 \
+    PIP_ROOT_USER_ACTION=ignore
+
+WORKDIR /workspace
+
+COPY requirements-gpu.txt .
+
+RUN pip install \
+    --no-cache-dir \
+    -r requirements-gpu.txt \
+    && rm requirements-gpu.txt
diff --git a/README.md b/README.md
@@ -1,48 +1,53 @@
-# Mistral fine-tuning with Valohai
+# Mistral Fine-Tuning with Valohai
 
 This project serves as an on-ramp to [Valohai][vh] and is designed to be the first step for individuals starting with their self-serve trial.
 The primary goal of this template is to showcase the power of Valohai for fine-tuning large language models, with a special focus on the Mistral 7B model.
 
-
 [vh]: https://valohai.com/
 [app]: https://app.valohai.com
-## <div align="center">Steps</div>
+[hf_login]: https://huggingface.co/login
+[hf_mistral]: https://huggingface.co/mistralai/Mistral-7B-v0.1
+
+## <div align="center">Overview</div>
+
 ### **Data Preprocessing**:
 
 * **Loading Data**:
 In our project, data is seamlessly fetched from our S3 bucket. 
-When you initiate an execution, the data is automatically stored in the `/valohai/inputs/` directory on the machine. Additionally, the tokenizer is sourced directly from the Hugging Face repository and it is also available in `/valohai/inputs/` directory.
+When you initiate an execution, the data is automatically stored in the `/valohai/inputs/` directory on the machine. Additionally, the tokenizer is sourced directly from the Hugging Face repository, and it is also available in `/valohai/inputs/` directory.
 
 * **Tokenization**: To make the data suitable for language models, it's tokenized using the tokenizer from Hugging Face's Mistral repository. Tokenization basically means breaking down the text into smaller units, like words or subwords, so that the model can work with it.
 
 * **Saving Processed Data**: After tokenization, the processed data is saved in a way that makes it easy to use later. This processed data is saved to Valohai datasets with a special alias, making it convenient for further steps in the machine learning process.
 
 This streamlined workflow empowers you to focus on your machine learning tasks, while Valohai handles data management, versioning, and efficient storage.
 
-### **Model fine-tuning**:
+### **Model Fine-Tuning**:
+
 * **Loading Data and Model**: The code loads the prepared training data from Valohai datasets. It also fetches the base model from an S3 bucket. This base model is a pre-trained Mistral model.
 
 * **Model Enhancement**: The base model is enhanced to make it better for training with a method called "PEFT." This enhancement involves configuring the model for better training performance.
 
-* **Training the Model**: The script then trains the model using the prepared data using the Trainer from transformers library. It fine-tunes the model, making it better at understanding video gaming text.
+* **Training the Model**: The script then trains the model using the prepared data using Trainer from the Transformers library. It fine-tunes the model, making it better at understanding video gaming text.
 
 Saving Results: After training, the script saves checkpoints of the model's progress. These checkpoints are stored in Valohai datasets for easy access in the next steps, like inference.
 
-### **Model inference**:
+### **Model Inference**:
 
 In the inference step, we use the fine-tuned language model to generate text based on a given prompt. Here's a simplified explanation of what happens in this code:
 
 * **Loading Model and Checkpoints**: The code loads the base model from an S3 bucket and the fine-tuned checkpoint from the previous step, which is stored in Valohai datasets.
 
-* **Inference** : Using the fine-tuned model and provided test prompt, we obtain a model-generated response, which is decoded by tokenizer to make it human-readable.
+* **Inference**: Using the fine-tuned model and provided test prompt, we get a model-generated response, which tokenizer decodes to make it human-readable.
+
+## <div align="center">Setup</div>
 
-## <div align="center">Installation</div>
+Before we can run any code, we need to set up the project. This section explains how to set up the project using the Valohai web app or the terminal.
 
-Login to the [Valohai app][app] and create a new project.
+<details>
+<summary>🌐 Using the web app</summary>
 
-### Configure the repository:
-<details open>
-<summary>Using UI</summary>
+Login to [the Valohai web app][app] and create a new project.
 
 Configure this repository as the project's repository, by following these steps:
 
@@ -53,58 +58,93 @@ Configure this repository as the project's repository, by following these steps:
 5. Click on the Save button to save the changes.
 </details>
 
-<details open>
-<summary>Using terminal</summary>
+<details>
+<summary>⌨️ Using the terminal</summary>
 
 To run your code on Valohai using the terminal, follow these steps:
 
 1. Install Valohai on your machine by running the following command:
-```bash
-pip install valohai-cli valohai-utils
-```
+
+    ```bash
+    pip install valohai-cli
+    ```
 
 2. Log in to Valohai from the terminal using the command:
-```bash
-vh login
-```
+
+    ```bash
+    vh login
+    ```
 
 3. Create a project for your Valohai workflow.
 Start by creating a directory for your project:
-```bash
-mkdir valohai-mistral-example
-cd valohai-mistral-example
-```
 
-Then, create the Valohai project:
-```bash
-vh project create
-```
+    ```bash
+    mkdir valohai-mistral-example
+    cd valohai-mistral-example
+    ```
+
+    Then, create the Valohai project:
+    ```bash
+    vh project create
+    ```
 
 4. Clone the repository to your local machine:
-```bash
-git clone https://github.com/valohai/mistral-example.git .
-```
+
+    ```bash
+    git clone https://github.com/valohai/mistral-example.git .
+    ```
 
 </details>
 
-Now you are ready to run executions and pipelines.
+<details>
+<summary>🌐 / ⌨️ Setup for both</summary>
+
+Authorize the Valohai project to download models and tokenizers from Hugging Face.
+
+1. Login to [the Hugging Face platform][hf_login]
+
+2. Agree on [the terms of Mistral model][hf_mistral]; the license is Apache 2.
+
+    ![Agree to the terms set by Mistral to use their models](.github/screenshots/hf_agree_to_terms.png)
+
+3. Create an access token under Hugging Face settings.
+
+    ![Access token controls under Hugging Face settings](.github/screenshots/hf_access_token_page.png)
+
+    ![Access token creation form under Hugging Face settings](.github/screenshots/hf_create_token.png)
+
+    _You can either choose to allow access to all public models you've agreed to or only the Mistral model._
+
+    Copy the token and store it in a secure place, you won't be seeing it again.
+
+    ![Copy the token for later use](.github/screenshots/hf_get_token.png)
+
+4. Add the `hf_xxx` token to your Valohai project as a secret named `HF_TOKEN`.
+
+    ![Valohai project environmental variable configuration page](.github/screenshots/vh_project_env_vars.png)
+
+    Now all workloads on this project have scoped access to Hugging Face if you don't specifically restrict them.
+
+</details>
 
 ## <div align="center">Running Executions</div>
-This repository covers essential tasks such as data preprocessing, model fine-tuning and inference using Mistral model.
-<details open>
-<summary>Using UI</summary>
+
+This repository defines the essential tasks or "steps" like data preprocessing, model fine-tuning and inference of Mistral models. You can execute these tasks individually or as part of a pipeline. This section covers how you can run them individually.
+
+<details>
+<summary>🌐 Using the web app</summary>
 
 1. Go to the Executions tab in your project.
 2. Create a new execution by selecting the predefined steps: _data-preprocess_, _finetune_, _inference_.
 3. Customize the execution parameters if needed.
 4. Start the execution to run the selected step.
 
- ![alt text](https://github.com/valohai/mistral-example/blob/main/screenshots/create_execution.jpeg)
+     ![Create execution page on Valohai UI](.github/screenshots/create_execution.jpeg)
 
 </details>
 
-<details open>
-<summary>Using terminal</summary>
+<details>
+<summary>⌨️ Using the terminal</summary>
 
 To run individual steps, execute the following command:
 ```bash
@@ -118,23 +158,25 @@ vh execution run data-preprocess --adhoc
 
 </details>
 
-## <div align="center">Running Pipeline</div>
+## <div align="center">Running Pipelines</div>
 
-<details open>
-<summary>Using UI</summary>
+When you have a collection of tasks that you want to run together, you create a pipeline. This section explains how to run the predefined pipelines in this repository.
+
+<details>
+<summary>🌐 Using the web app</summary>
 
 1. Navigate to the Pipelines tab.
 2. Create a new pipeline and select out the blueprint _training-pipeline_.
-3. Create pipeline from template.
+3. Create a pipeline from template.
 4. Configure the pipeline settings.
 5. Create pipeline.
 
-![alt text](https://github.com/valohai/mistral-example/blob/main/screenshots/create_pipeline.jpeg)
+    ![Choosing of pipeline blueprint on Valohai UI](.github/screenshots/create_pipeline.jpeg)
 
 </details>
 
-<details open>
-<summary>Using terminal</summary>
+<details>
+<summary>⌨️ Using the terminal</summary>
 
 To run pipelines, use the following command:
 
@@ -150,11 +192,11 @@ vh pipeline run training-pipeline --adhoc
 
 The completed pipeline view:
 
-![alt text](https://github.com/valohai/mistral-example/blob/main/screenshots/completed_pipeline.jpeg)
-
+![Graph of the completed pipeline on Valohai UI](.github/screenshots/completed_pipeline.jpeg)
 
 The generated response by the model looks like this:
 
-![alt text](https://github.com/valohai/mistral-example/blob/main/screenshots/inference_result.jpeg)
+![Showcasing the LLM responses inside a Valohai execution](.github/screenshots/inference_result.jpeg)
 
-We need to consider that the model underwent only a limited number of fine-tuning steps, so achieving satisfactory results might necessitate further experimentation with model parameters.
+> [!IMPORTANT]
+> The example configuration undergoes only a limited number of fine-tuning steps. To achieve satisfactory results might require further experimentation with model parameters.
diff --git a/data-preprocess.py b/data-preprocess.py
@@ -5,16 +5,15 @@
 
 import valohai
 from datasets import load_dataset
-from transformers import AutoTokenizer
 
-from helpers import get_run_identification
+from helpers import get_run_identification, get_tokenizer, promptify
 
 
 class DataPreprocessor:
     def __init__(self, args):
-        self.data_path = args.data_path or os.path.dirname(valohai.inputs('dataset').path())
-        self.model_max_length = args.model_max_length
-        self.tokenizer = args.tokenizer
+        self.data_path = args.data_path or valohai.inputs('dataset').dir_path()
+        self.model_id = args.model_id
+        self.max_tokens = args.max_tokens
         dataset = load_dataset(
             'csv',
             data_files={
@@ -33,30 +32,18 @@ def prepare_datasets(self, generate_and_tokenize_prompt):
         return tknzd_train_dataset, tknzd_val_dataset
 
     def generate_and_tokenize_prompt(self, data_point, tokenizer):
-        full_prompt = f"""Given a meaning representation generate a target sentence that utilizes the attributes and attribute values given. The sentence should use all the information provided in the meaning representation.
-        ### Target sentence:
-        {data_point["ref"]}
-
-        ### Meaning representation:
-        {data_point["mr"]}
-        """
-        return tokenizer(full_prompt, truncation=True, max_length=self.model_max_length, padding='max_length')
+        prompt = promptify(sentence=data_point['ref'], meaning=data_point['mr'])
+        return tokenizer(prompt, truncation=True, max_length=self.max_tokens)
 
     def load_and_prepare_data(self):
-        tokenizer = AutoTokenizer.from_pretrained(
-            self.tokenizer,
-            model_max_length=self.model_max_length,
-            padding_side='left',
-            add_eos_token=True,
-        )
-        tokenizer.pad_token = tokenizer.eos_token
+        tokenizer = get_tokenizer(self.model_id, self.max_tokens)
         tokenized_train_dataset, tokenized_val_dataset = self.prepare_datasets(
             lambda data_point: self.generate_and_tokenize_prompt(data_point, tokenizer),
         )
         return tokenized_train_dataset, tokenized_val_dataset, self.test_dataset
 
     @staticmethod
-    def save_dataset(dataset, tag='train'):
+    def save_dataset(dataset, tag):
         project_name, exec_id = get_run_identification()
 
         metadata = {
@@ -79,19 +66,17 @@ def save_dataset(dataset, tag='train'):
 
 def main():
     logging.basicConfig(level=logging.INFO)
-    parser = argparse.ArgumentParser(description='Prepare data')
 
-    # Add arguments based on your script's needs
+    parser = argparse.ArgumentParser(description='Prepare data')
+    # fmt: off
     parser.add_argument('--data_path', type=str, default=None)
-    parser.add_argument('--tokenizer', type=str, default='mistralai/Mistral-7B-v0.1', help='Huggingface tokenizer link')
-    parser.add_argument('--model_max_length', type=int, default=512, help='Maximum length for the model')
-
+    parser.add_argument('--model_id', type=str, default="mistralai/Mistral-7B-v0.1", help="Model identifier from Hugging Face, also defines the tokenizer")
+    parser.add_argument('--max_tokens', type=int, default=512, help="The maximum number of tokens that the model can process in a single forward pass")
+    # fmt: on
     args = parser.parse_args()
 
     data_preprocessor = DataPreprocessor(args)
-
     tokenized_train_dataset, tokenized_val_dataset, test_dataset = data_preprocessor.load_and_prepare_data()
-
     data_preprocessor.save_dataset(tokenized_train_dataset, 'train')
     data_preprocessor.save_dataset(tokenized_val_dataset, 'val')
     data_preprocessor.save_dataset(test_dataset, 'test')