Small fixes to Lit-GPT demo (#334)

GoogleCloudPlatform · Nov 17, 2023 · 6c5e3fe · 6c5e3fe
2 parents b6dc51a + ef46259
commit 6c5e3fe
Show file tree

Hide file tree

Showing 6 changed files with 68 additions and 52 deletions.
diff --git a/sample_workloads/lit-gpt-demo/README.md b/sample_workloads/lit-gpt-demo/README.md
@@ -1,17 +1,17 @@
-## Overview 
+## Overview
 
 This document provides instructions on running a sample PyTorch-based workload on A3 using TCPx, including the limitations with general PyTorch integration.
 
 
-## Pre-Requisites 
+## Pre-Requisites
 
 This guide assumes that you already have created a GKE cluster according to this repo with the proper GPU drivers and host images for TCPx.
 
 
-## Limitations 
+## Limitations
 
 
-### TCPx Limitations with Pytorch versions 
+### TCPx Limitations with Pytorch versions
 
 TCPx currently supports a specific NCCL version, which limits the supported versions of Pytorch. The released TCPx binary officially supports NCCL version `2.18.1`, and an unreleased version `2.18.5unpack_memsyncapifix `based on [this commit.](https://github.com/NVIDIA/nccl/commit/321549b7d5e6039a86c0431d0c85e996f9f5fe12) This NCCL version will be installed on the host VM by the nccl-installer daemonset (v3.1.6_2023_10_06). \
 
@@ -23,12 +23,12 @@ If you are comfortable with using the unofficial nccl version then you can use a
 Some testing has also been done on 2.17.1 (image versions [23.04-py3](http://nvcr.io/nvidia/pytorch:23.04-py3), [23.03-py3](http://nvcr.io/nvidia/pytorch:23.03-py3)) and it is functional, but not considered officially supported.
 
 
-## LitGPT Sample Workload 
+## LitGPT Sample Workload
 
 If you are building LitGPT from source, we recommend running these commands on a shell with plenty of Memory available (please read our [troubleshooting section](https://docs.google.com/document/d/14x-Lim29ZdcpudJalBn12sQVy9oqutlHr-QOJnvRwxA/edit?resourcekey=0-xfgzT7fofhl9K5qCwSRdLg#heading=h.1zt9nloo6lvr) for more information). If you are consuming the pre-built LitGPT image, then these commands can be run on any shell where you can install docker.
 
 
-### Environment Setup 
+### Environment Setup
 
 
 
@@ -42,7 +42,7 @@ export CLUSTER_NAME=<name of GKE cluster>
 export REGION=<region>
 export PROJECT_ID=<project>
 ```
-    
+
 
 3. Install `kubectl` and fetch credentials for your GKE cluster.
 ```
@@ -52,16 +52,16 @@ gcloud container clusters get-credentials $CLUSTER_NAME --region $REGION --proje
 ```
 4. Install Helm.
 ```
-curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 
-chmod 700 get_helm.sh 
+curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
+chmod 700 get_helm.sh
 ./get_helm.sh
 sudo chmod +x /usr/local/bin/helm
 ```
 
-### Set up Lit-GPT 
+### Set up Lit-GPT
 
 
-### Use Pre-built Docker Image 
+### Use Pre-built Docker Image
 
 A pre-built example for quickly running LitGPT is available as a sample workload in the  [ai-infra-cluster-provisioning](https://github.com/GoogleCloudPlatform/ai-infra-cluster-provisioning/tree/develop/sample_workloads/lit-gpt-demo) repo. See [Run LitGPT](#run-lit-gpt) for the next set of instructions.
 
@@ -107,15 +107,15 @@ Several additional parameters are available in the helm values.yaml file when us
 
 
 
-### Build Custom Docker Image 
+### Build Custom Docker Image
 
 If you would rather modify and set up LitGPT on your own, for example if you want to add custom model configs or additional hyperparameter tuning, follow these steps to build the image from source.
 
 
-#### Docker Image Setup 
+#### Docker Image Setup
 
 
-##### Setup Artifact Registry 
+##### Setup Artifact Registry
 
 Follow [https://cloud.google.com/artifact-registry/docs/repositories/create-repos](https://cloud.google.com/artifact-registry/docs/repositories/create-repos), make sure to create this for Docker images.
 
@@ -133,10 +133,10 @@ Set` $ARTIFACT_REGISTRY` to the Registry URL returned.
 export ARTIFACT_REGISTRY=<artifact_registry>
 ```
 
-**Note:** `ARTIFACT_REGISTRY `is generally in the format of `{LOCATION}-docker.pkg.dev/{PROJECT_ID}/{REGISTRY_NAME}`. 
+**Note:** `ARTIFACT_REGISTRY `is generally in the format of `{LOCATION}-docker.pkg.dev/{PROJECT_ID}/{REGISTRY_NAME}`.
 
 
-### Setup Docker 
+### Setup Docker
 
 We need to install docker since we plan to create our own docker images. Please refer to [https://docs.docker.com/engine/install/](https://docs.docker.com/engine/install/) for a docker installation guide. Once docker is installed, we need to setup docker with gcloud Please run the following (or follow [https://cloud.google.com/artifact-registry/docs/docker/authentication](https://cloud.google.com/artifact-registry/docs/docker/authentication)) \
 
@@ -146,7 +146,7 @@ gcloud auth configure-docker $LOCATION-docker.pkg.dev
 ```
 
 
-### Setup Docker files and Scripts 
+### Setup Docker files and Scripts
 
 Please clone the <code>[ai-infra-cluster-provisioning](https://github.com/GoogleCloudPlatform/ai-infra-cluster-provisioning)</code> repo.
 
@@ -170,7 +170,7 @@ sudo -E bash build_and_push_litgpt.sh
 Once the command is done, a new **image** **tag** will be output in the console. Please keep a record of it, which will be used in the [Helm Config File Setup](#helm-config-file-setup).
 
 
-### Setting up data 
+### Setting up data
 
 This Lit-GPT training example uses the openwebtext dataset, which can be installed by following [https://github.com/Lightning-AI/lit-gpt/blob/main/tutorials/pretrain_openwebtext.md](https://github.com/Lightning-AI/lit-gpt/blob/main/tutorials/pretrain_openwebtext.md). Please upload these to a Google Cloud Storage (GCS) bucket ([https://cloud.google.com/storage/docs/creating-buckets](https://cloud.google.com/storage/docs/creating-buckets)).
 
@@ -182,17 +182,17 @@ Alternatively, you can find pre-copied versions of this data at:
 **Note:** If you use the `litgpt-public-bucket` to load the dataset then you will not be able to upload your training run data to a GCS bucket. If you want GCS logs for your training run then copy those blobs to a bucket that you have write permissions to.
 
 
-### Setup for distributed training 
+### Setup for distributed training
 
 In the definition of the `Trainer` object ([https://github.com/Lightning-AI/lit-gpt/blob/main/pretrain/openwebtext_trainer.py#L123-L135](https://github.com/Lightning-AI/lit-gpt/blob/main/pretrain/openwebtext_trainer.py#L123-L135)), we need to add another argument: `num_nodes`. This should match the `nNodes` value in `helm/values.yaml`.
 
 **Note:** Both of the requested code changes above are already present in [https://github.com/GoogleCloudPlatform/ai-infra-cluster-provisioning/blob/develop/sample_workloads/lit-gpt-demo/openwebtext_trainer.py](https://github.com/GoogleCloudPlatform/ai-infra-cluster-provisioning/blob/litgptparams/sample_workloads/lit-gpt-demo/openwebtext_trainer.py)
 
 
-### Additional Changes to Lit-GPT code 
+### Additional Changes to Lit-GPT code
 
 
-#### Add New Model Configurations 
+#### Add New Model Configurations
 
 To change the model configuration, please change [https://github.com/Lightning-AI/lit-gpt/blob/main/pretrain/openwebtext_trainer.py#L24](https://github.com/Lightning-AI/lit-gpt/blob/main/pretrain/openwebtext_trainer.py#L24). You can change this model_name to any name in [https://github.com/Lightning-AI/lit-gpt/blob/main/lit_gpt/config.py](https://github.com/Lightning-AI/lit-gpt/blob/main/lit_gpt/config.py). We also recommend adding your own configurations.
 
@@ -207,21 +207,21 @@ transformers = [
 
 
 
-#### Hyperparameter changes 
+#### Hyperparameter changes
 
 Please take a look at [https://github.com/Lightning-AI/lit-gpt/blob/main/pretrain/openwebtext_trainer.py#L24-L46](https://github.com/Lightning-AI/lit-gpt/blob/main/pretrain/openwebtext_trainer.py#L24-L46) 
 
 If you want to customize hyperparameters or parts of the model, please either (1) make the adjustments in the lit-gpt source code or (2) add some flags to adjust in the command line. Look at `litgpt_container_entrypoint.sh` for where exactly the training script is being called.
 
 
-### Run Lit-GPT 
+### Run Lit-GPT
 
 **Note: **If there are any changes to `lit-gpt`, please build and push a newer docker image.
 
 
-### Helm Config File Setup 
+### Helm Config File Setup
 
-Next, create a copy of `helm/values.yaml.example` without the `.example` ending. 
+Next, create a copy of `helm/values.yaml.example` without the `.example` ending.
 
 ```
 cp helm/values.yaml.example helm/values.yaml
@@ -240,12 +240,16 @@ network:
   rxdmContainer: "us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/tcpgpudmarxd:v2.0.7"
   disablePmtu: "yes"
 workload:
-  gcsBucket: litgpt-public-bucket
-  jobTimestamp: 1
-  experimentDir: pir-pythia-6.9b/training_logs/
-  dataDir: openwebtext_dataset
-  image: us-docker.pkg.dev/gce-ai-infra/litgpt-full/litgpt
+  jobTimestamp: <int: add a timestamp here or unique identifier>
+  gcsExperimentBucket: <str: your gcs bucket where experiment logs should go>
+  experimentDir: <str: root gcs directory of experiment>
+  gcsDataBucket: <str: your gcs bucket where data is located>
+  dataDir: <str: location with gcs bucket of where train.bin and val.bin are located>
+  image: us-central1-docker.pkg.dev/<YOUR PROJECT ID>/<ARTIFACT REGISTRY NAME>/litgpt-full:<ADD TAG HERE>
   configDirInBucket: null
+  batchSize: 6
+  microBatchSize: 6
+  modelName: Llama-2-70b-hf
 ```
 
 In the helm config file `values.yaml`, you need to make changes to the following seven flags based on the workload requirements:
@@ -254,24 +258,27 @@ In the helm config file `values.yaml`, you need to make changes to the following
 
 * `nodePool`
 * `nNodes`
-* `gcsBucket`
 * `jobTimestamp`
+* `gcsExperimentBucket`
 * `experimentDir`
+* `gcsDataBucket`
 * `dataDir`
 * `image`
 
 `nodePool` refers to the name of the GKE NodePool where the LitGPT job will be run.
 
 `nNodes` refers to the number of GKE GPU nodes in the GKE NodePool (specified by `nodePool`) for running the LitGPT job. Note that the value of `nNodes` cannot exceed the total number of GKE GPU nodes in that NodePool.
 
-`gcsBucket `refers to a GCS bucket. In the example above,  `litgpt-public-bucket` contains the pre-copied versions of openwebtext dataset. Alternatively, you can use your own GCS bucket (e.g. named `myBucket`).
-
 `jobTimestamp `needs to be a timestamp or a unique identifier.
 
-`experimentDir `refers to a directory in the GCS bucket (specified by `gcsBucket`) for logging. In the example above,` pir-pythia-6.9b/training_logs/` is a directory already set up for logging in the shared GCS bucket `litgpt-public-bucket`.  \
+`gcsExperimentBucket `refers to a GCS bucket where to store the experimental logs and output.
+
+`experimentDir `refers to a directory in the GCS bucket (specified by `gcsExperimentBucket`) for logging. In the example above,` pir-pythia-6.9b/training_logs/` is a directory already set up for logging in the shared GCS bucket `litgpt-public-bucket`.  \
 Alternatively, you can create your own directory (e.g. named `logDir`) under your own GCS bucket (e.g. named `myBucket`). Then the logs will be saved at the target location (e.g. `gs://myBucket`/`logDir`) designated by the parameter `experimentDir`.
 
-`dataDir `refers to a directory in the GCS bucket (specified by `gcsBucket`) for training data. In the example above, `openwebtext_dataset` is a directory containing the training data in the shared GCS bucket `litgpt-public-bucket`. \
+`gcsDataBucket `refers to a GCS bucket where your data is stored. In the example above, `litgpt-public-bucket` contains the pre-copied versions of openwebtext dataset. Alternatively, you can use your own GCS bucket (e.g. named `myBucket`).
+
+`dataDir `refers to a directory in the GCS bucket (specified by `gcsDataBucket`) for training data. In the example above, `openwebtext_dataset` is a directory containing the training data in the shared GCS bucket `litgpt-public-bucket`. \
 Alternatively, you can create your own directory (e.g. named `dataSourceDir`) under your own GCS bucket (e.g. named `myBucket`). Then the training data will be loaded from the source location (e.g. `gs://myBucket/dataSourceDir/train.bin` and `gs://myBucket/dataSourceDir/val.bin`) designated by the parameter `dataDir`.
 
 `image `refers to the Docker image set up for LitGPT. The value of this flag is in the following format:  `<repository_location>-docker.pkg.dev/<project>/<repository_name>/litgpt-full:<tag>`
@@ -287,15 +294,15 @@ You can check the status of your workload via any of the following:
 **Note:** pod0 contains logs that other pods in the same experiment do not contain.
 
 
-### MFU Calculation 
+### MFU Calculation
 
 MFU can be calculated by consuming the `metrics.csv` file output by LitGPT. During a training run this file can be found in the litgpt container at `/workspace/out/openwebtext/version_0/metrics.csv` . After training is completed this file will be uploaded as part of the `experimentDir` specified in the helm values.
 
 Step times are presented in the csv as aggregate times in the column `time/train`, so the value used should be 
-`time/train[n] - time/train[n-1]` .
+`time/train[n] - time/train[n-1]`.
 
 
-MFU for this sample workload can be calculated using the formula: 
+MFU for this sample workload can be calculated using the formula:
 
 ```
 estimated_flops = ops_per_step * trainable_flops * batch_size
@@ -312,7 +319,7 @@ For example, running Llama2-70b on 40 VMs would have you calculate this as:
 The MFU value is also available in the `metrics.csv` file after 50 iterations at column `throughput/device/mfu`, though we have seen inconsistent numbers reported and recommend calculating it manually.
 
 
-## Troubleshooting 
+## Troubleshooting
 
 **Docker build issues**
 

diff --git a/sample_workloads/lit-gpt-demo/build_and_push_litgpt.sh b/sample_workloads/lit-gpt-demo/build_and_push_litgpt.sh
@@ -13,6 +13,9 @@ FULL_IMAGE=${FULL_IMAGE:="$ARTIFACT_REGISTRY/litgpt-full"}
 # Clone LitGPT and checkout a flash-attn enabled commit
 if [ ! -d $LITGPT_PATH ]; then
     git clone https://github.com/Lightning-AI/lit-gpt.git
+    cd lit-gpt
+    git checkout d5d371417ecb3d3b6c4f30837d8bb7cf2b5310ae
+    cd ..
     LITGPT_PATH=lit-gpt
 fi
 

diff --git a/sample_workloads/lit-gpt-demo/helm/templates/litgpt.yaml b/sample_workloads/lit-gpt-demo/helm/templates/litgpt.yaml
@@ -1,9 +1,10 @@
 {{- $requiredVar := .Values.cluster.nNodes | required ".Values.cluster.nNodes is required" -}}
 {{- $requiredVar := .Values.cluster.nodePool | required ".Values.cluster.nodePool is required" -}}
 {{- $requiredVar := .Values.network.ncclIfnames | required ".Values.ncclIfnames is required" -}}
-{{- $requiredVar := .Values.workload.gcsBucket | required ".Values.gcsBucket is required" -}}
 {{- $requiredVar := .Values.workload.jobTimestamp | required ".Values.jobTimestamp is required" -}}
+{{- $requiredVar := .Values.workload.gcsExperimentBucket | required ".Values.gcsExperimentBucket is required" -}}
 {{- $requiredVar := .Values.workload.experimentDir | required ".Values.experimentDir is required" -}}
+{{- $requiredVar := .Values.workload.gcsDataBucket | required ".Values.gcsDataBucket is required" -}}
 {{- $requiredVar := .Values.workload.dataDir| required ".Values.dataDir is required" -}}
 {{- $requiredVar := .Values.workload.image | required ".Values.image is required" -}}
 apiVersion: v1
@@ -145,10 +146,12 @@ spec:
         value: "{{$root.Values.network.disablePmtu}}"
       - name: CPU_PINNING_MODE
         value: "{{$root.Values.network.cpuPinningMode}}"
-      - name: GCS_BUCKET
-        value: "{{$root.Values.workload.gcsBucket}}"
+      - name: GCS_EXPERIMENT_BUCKET
+        value: "{{$root.Values.workload.gcsExperimentBucket}}"
       - name: EXPERIMENT_ROOT_DIR
         value: "{{$root.Values.workload.experimentDir}}"
+      - name: GCS_DATA_BUCKET
+        value: "{{$root.Values.workload.gcsDataBucket}}"
       - name: DATA_DIR
         value: "{{$root.Values.workload.dataDir}}"
       - name: BATCH_SIZE

diff --git a/sample_workloads/lit-gpt-demo/helm/values.yaml.example b/sample_workloads/lit-gpt-demo/helm/values.yaml.example
@@ -10,9 +10,10 @@ network:
 
 
 workload:
-  gcsBucket: <str: your gcs bucket where data is located>
   jobTimestamp: <int: add a timestamp here or unique identifier>
+  gcsExperimentBucket: <str: your gcs bucket where experiment logs should go>
   experimentDir: <str: root gcs directory of experiment>
+  gcsDataBucket: <str: your gcs bucket where data is located>
   dataDir: <str: location with gcs bucket of where train.bin and val.bin are located>
   image: us-central1-docker.pkg.dev/<YOUR PROJECT ID>/<ARTIFACT REGISTRY NAME>/litgpt-full:<ADD TAG HERE>
   configDirInBucket: null

diff --git a/sample_workloads/lit-gpt-demo/openwebtext_trainer.py b/sample_workloads/lit-gpt-demo/openwebtext_trainer.py
@@ -25,7 +25,7 @@
 
 model_name = os.getenv("MODEL_NAME", "Llama-2-70b-hf")
 name = "openwebtext"
-out_dir = Path("out") / name
+out_dir = Path(os.getenv("EXPERIMENT_LOCAL_DIR", "")) / "out"
 data_dir = Path("/data")
 save_interval = 1000
 eval_interval = 1000
@@ -123,7 +123,7 @@ def main(devices: int = 1, precision: Optional[str] = None, tpu: bool = False) -
     else:
         strategy = "auto"
 
-    logger = step_csv_logger("out", name, cls=CSVLogger, flush_logs_every_n_steps=log_interval)
+    logger = step_csv_logger(out_dir, name, cls=CSVLogger, flush_logs_every_n_steps=log_interval)
     speed_monitor = SpeedMonitorCallback(
         length_fn=lambda batch: batch[0].size(1), batch_size=micro_batch_size, window_size=50, time_unit="seconds"
     )

diff --git a/sample_workloads/lit-gpt-demo/scripts/litgpt_container_entrypoint.sh b/sample_workloads/lit-gpt-demo/scripts/litgpt_container_entrypoint.sh
@@ -5,23 +5,25 @@ set -o pipefail
 
 : "${MASTER_ADDR:?Must set MASTER_ADDR}"
 : "${NODE_RANK:?Must set NODE_RANK}"
-: "${GCS_BUCKET:?Must set GCS_BUCKET}"
 : "${JOB_TIMESTAMP:?Must set JOB_TIMESTAMP}"
-: "${EXPERIMENT_ROOT_DIR:?Must set EXPERIMENT_ROOT_DIR}"
 : "${NNODES:?Must set NNODES}"
+: "${GCS_EXPERIMENT_BUCKET:?Must set GCS_EXPERIMENT_BUCKET}"
+: "${EXPERIMENT_ROOT_DIR:?Must set EXPERIMENT_ROOT_DIR}"
+: "${GCS_DATA_BUCKET:?Must set GCS_DATA_BUCKET}"
 : "${DATA_DIR:?Must set DATA_DIR}"
 
-EXPERIMENT_LOCAL_DIR=/experiment/${EXPERIMENT_ROOT_DIR}
+export EXPERIMENT_LOCAL_DIR=/experiment/${EXPERIMENT_ROOT_DIR}
+
 mkdir -p $EXPERIMENT_LOCAL_DIR
 
 echo $EXPERIMENT_ROOT_DIR
 echo $EXPERIMENT_LOCAL_DIR
 
-gsutil rsync -r gs://${GCS_BUCKET}/${EXPERIMENT_ROOT_DIR}/ ${EXPERIMENT_LOCAL_DIR}/
+gsutil rsync -r gs://${GCS_EXPERIMENT_BUCKET}/${EXPERIMENT_ROOT_DIR}/ ${EXPERIMENT_LOCAL_DIR}/
 
 LOCAL_DATA_DIR=/data
 mkdir -p $LOCAL_DATA_DIR
-gsutil -m rsync gs://${GCS_BUCKET}/${DATA_DIR} /data
+gsutil -m rsync gs://${GCS_DATA_BUCKET}/${DATA_DIR} /data
 
 export MASTER_PORT=6002
 export GPUS_PER_NODE=8
@@ -117,8 +119,8 @@ function on_script_completion {
    # semaphore to cleanly exit hardware utilization monitor
    touch /tmp/workload_terminated
 
-   echo "Uploading ${EXPERIMENT_LOCAL_DIR} to gs://${GCS_BUCKET}/${EXPERIMENT_ROOT_DIR}/"
-   gsutil rsync -r ${EXPERIMENT_LOCAL_DIR}/ gs://${GCS_BUCKET}/${EXPERIMENT_ROOT_DIR}/
+   echo "Uploading ${EXPERIMENT_LOCAL_DIR} to gs://${GCS_EXPERIMENT_BUCKET}/${EXPERIMENT_ROOT_DIR}/"
+   gsutil rsync -r ${EXPERIMENT_LOCAL_DIR}/ gs://${GCS_EXPERIMENT_BUCKET}/${EXPERIMENT_ROOT_DIR}/
 }