diff --git a/sample_workloads/lit-gpt-demo/README.md b/sample_workloads/lit-gpt-demo/README.md index e5e324b1..cf98c70d 100644 --- a/sample_workloads/lit-gpt-demo/README.md +++ b/sample_workloads/lit-gpt-demo/README.md @@ -1,17 +1,17 @@ -## Overview +## Overview This document provides instructions on running a sample PyTorch-based workload on A3 using TCPx, including the limitations with general PyTorch integration. -## Pre-Requisites +## Pre-Requisites This guide assumes that you already have created a GKE cluster according to this repo with the proper GPU drivers and host images for TCPx. -## Limitations +## Limitations -### TCPx Limitations with Pytorch versions +### TCPx Limitations with Pytorch versions TCPx currently supports a specific NCCL version, which limits the supported versions of Pytorch. The released TCPx binary officially supports NCCL version `2.18.1`, and an unreleased version `2.18.5unpack_memsyncapifix `based on [this commit.](https://github.com/NVIDIA/nccl/commit/321549b7d5e6039a86c0431d0c85e996f9f5fe12) This NCCL version will be installed on the host VM by the nccl-installer daemonset (v3.1.6_2023_10_06). \ @@ -23,12 +23,12 @@ If you are comfortable with using the unofficial nccl version then you can use a Some testing has also been done on 2.17.1 (image versions [23.04-py3](http://nvcr.io/nvidia/pytorch:23.04-py3), [23.03-py3](http://nvcr.io/nvidia/pytorch:23.03-py3)) and it is functional, but not considered officially supported. -## LitGPT Sample Workload +## LitGPT Sample Workload If you are building LitGPT from source, we recommend running these commands on a shell with plenty of Memory available (please read our [troubleshooting section](https://docs.google.com/document/d/14x-Lim29ZdcpudJalBn12sQVy9oqutlHr-QOJnvRwxA/edit?resourcekey=0-xfgzT7fofhl9K5qCwSRdLg#heading=h.1zt9nloo6lvr) for more information). If you are consuming the pre-built LitGPT image, then these commands can be run on any shell where you can install docker. -### Environment Setup +### Environment Setup @@ -42,7 +42,7 @@ export CLUSTER_NAME= export REGION= export PROJECT_ID= ``` - + 3. Install `kubectl` and fetch credentials for your GKE cluster. ``` @@ -52,16 +52,16 @@ gcloud container clusters get-credentials $CLUSTER_NAME --region $REGION --proje ``` 4. Install Helm. ``` -curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 -chmod 700 get_helm.sh +curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 +chmod 700 get_helm.sh ./get_helm.sh sudo chmod +x /usr/local/bin/helm ``` -### Set up Lit-GPT +### Set up Lit-GPT -### Use Pre-built Docker Image +### Use Pre-built Docker Image A pre-built example for quickly running LitGPT is available as a sample workload in the [ai-infra-cluster-provisioning](https://github.com/GoogleCloudPlatform/ai-infra-cluster-provisioning/tree/develop/sample_workloads/lit-gpt-demo) repo. See [Run LitGPT](#run-lit-gpt) for the next set of instructions. @@ -107,15 +107,15 @@ Several additional parameters are available in the helm values.yaml file when us -### Build Custom Docker Image +### Build Custom Docker Image If you would rather modify and set up LitGPT on your own, for example if you want to add custom model configs or additional hyperparameter tuning, follow these steps to build the image from source. -#### Docker Image Setup +#### Docker Image Setup -##### Setup Artifact Registry +##### Setup Artifact Registry Follow [https://cloud.google.com/artifact-registry/docs/repositories/create-repos](https://cloud.google.com/artifact-registry/docs/repositories/create-repos), make sure to create this for Docker images. @@ -133,10 +133,10 @@ Set` $ARTIFACT_REGISTRY` to the Registry URL returned. export ARTIFACT_REGISTRY= ``` -**Note:** `ARTIFACT_REGISTRY `is generally in the format of `{LOCATION}-docker.pkg.dev/{PROJECT_ID}/{REGISTRY_NAME}`. +**Note:** `ARTIFACT_REGISTRY `is generally in the format of `{LOCATION}-docker.pkg.dev/{PROJECT_ID}/{REGISTRY_NAME}`. -### Setup Docker +### Setup Docker We need to install docker since we plan to create our own docker images. Please refer to [https://docs.docker.com/engine/install/](https://docs.docker.com/engine/install/) for a docker installation guide. Once docker is installed, we need to setup docker with gcloud Please run the following (or follow [https://cloud.google.com/artifact-registry/docs/docker/authentication](https://cloud.google.com/artifact-registry/docs/docker/authentication)) \ @@ -146,7 +146,7 @@ gcloud auth configure-docker $LOCATION-docker.pkg.dev ``` -### Setup Docker files and Scripts +### Setup Docker files and Scripts Please clone the [ai-infra-cluster-provisioning](https://github.com/GoogleCloudPlatform/ai-infra-cluster-provisioning) repo. @@ -170,7 +170,7 @@ sudo -E bash build_and_push_litgpt.sh Once the command is done, a new **image** **tag** will be output in the console. Please keep a record of it, which will be used in the [Helm Config File Setup](#helm-config-file-setup). -### Setting up data +### Setting up data This Lit-GPT training example uses the openwebtext dataset, which can be installed by following [https://github.com/Lightning-AI/lit-gpt/blob/main/tutorials/pretrain_openwebtext.md](https://github.com/Lightning-AI/lit-gpt/blob/main/tutorials/pretrain_openwebtext.md). Please upload these to a Google Cloud Storage (GCS) bucket ([https://cloud.google.com/storage/docs/creating-buckets](https://cloud.google.com/storage/docs/creating-buckets)). @@ -182,17 +182,17 @@ Alternatively, you can find pre-copied versions of this data at: **Note:** If you use the `litgpt-public-bucket` to load the dataset then you will not be able to upload your training run data to a GCS bucket. If you want GCS logs for your training run then copy those blobs to a bucket that you have write permissions to. -### Setup for distributed training +### Setup for distributed training In the definition of the `Trainer` object ([https://github.com/Lightning-AI/lit-gpt/blob/main/pretrain/openwebtext_trainer.py#L123-L135](https://github.com/Lightning-AI/lit-gpt/blob/main/pretrain/openwebtext_trainer.py#L123-L135)), we need to add another argument: `num_nodes`. This should match the `nNodes` value in `helm/values.yaml`. **Note:** Both of the requested code changes above are already present in [https://github.com/GoogleCloudPlatform/ai-infra-cluster-provisioning/blob/develop/sample_workloads/lit-gpt-demo/openwebtext_trainer.py](https://github.com/GoogleCloudPlatform/ai-infra-cluster-provisioning/blob/litgptparams/sample_workloads/lit-gpt-demo/openwebtext_trainer.py) -### Additional Changes to Lit-GPT code +### Additional Changes to Lit-GPT code -#### Add New Model Configurations +#### Add New Model Configurations To change the model configuration, please change [https://github.com/Lightning-AI/lit-gpt/blob/main/pretrain/openwebtext_trainer.py#L24](https://github.com/Lightning-AI/lit-gpt/blob/main/pretrain/openwebtext_trainer.py#L24). You can change this model_name to any name in [https://github.com/Lightning-AI/lit-gpt/blob/main/lit_gpt/config.py](https://github.com/Lightning-AI/lit-gpt/blob/main/lit_gpt/config.py). We also recommend adding your own configurations. @@ -207,21 +207,21 @@ transformers = [ -#### Hyperparameter changes +#### Hyperparameter changes Please take a look at [https://github.com/Lightning-AI/lit-gpt/blob/main/pretrain/openwebtext_trainer.py#L24-L46](https://github.com/Lightning-AI/lit-gpt/blob/main/pretrain/openwebtext_trainer.py#L24-L46) If you want to customize hyperparameters or parts of the model, please either (1) make the adjustments in the lit-gpt source code or (2) add some flags to adjust in the command line. Look at `litgpt_container_entrypoint.sh` for where exactly the training script is being called. -### Run Lit-GPT +### Run Lit-GPT **Note: **If there are any changes to `lit-gpt`, please build and push a newer docker image. -### Helm Config File Setup +### Helm Config File Setup -Next, create a copy of `helm/values.yaml.example` without the `.example` ending. +Next, create a copy of `helm/values.yaml.example` without the `.example` ending. ``` cp helm/values.yaml.example helm/values.yaml @@ -240,12 +240,16 @@ network: rxdmContainer: "us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/tcpgpudmarxd:v2.0.7" disablePmtu: "yes" workload: - gcsBucket: litgpt-public-bucket - jobTimestamp: 1 - experimentDir: pir-pythia-6.9b/training_logs/ - dataDir: openwebtext_dataset - image: us-docker.pkg.dev/gce-ai-infra/litgpt-full/litgpt + jobTimestamp: + gcsExperimentBucket: + experimentDir: + gcsDataBucket: + dataDir: + image: us-central1-docker.pkg.dev///litgpt-full: configDirInBucket: null + batchSize: 6 + microBatchSize: 6 + modelName: Llama-2-70b-hf ``` In the helm config file `values.yaml`, you need to make changes to the following seven flags based on the workload requirements: @@ -254,9 +258,10 @@ In the helm config file `values.yaml`, you need to make changes to the following * `nodePool` * `nNodes` -* `gcsBucket` * `jobTimestamp` +* `gcsExperimentBucket` * `experimentDir` +* `gcsDataBucket` * `dataDir` * `image` @@ -264,14 +269,16 @@ In the helm config file `values.yaml`, you need to make changes to the following `nNodes` refers to the number of GKE GPU nodes in the GKE NodePool (specified by `nodePool`) for running the LitGPT job. Note that the value of `nNodes` cannot exceed the total number of GKE GPU nodes in that NodePool. -`gcsBucket `refers to a GCS bucket. In the example above, `litgpt-public-bucket` contains the pre-copied versions of openwebtext dataset. Alternatively, you can use your own GCS bucket (e.g. named `myBucket`). - `jobTimestamp `needs to be a timestamp or a unique identifier. -`experimentDir `refers to a directory in the GCS bucket (specified by `gcsBucket`) for logging. In the example above,` pir-pythia-6.9b/training_logs/` is a directory already set up for logging in the shared GCS bucket `litgpt-public-bucket`. \ +`gcsExperimentBucket `refers to a GCS bucket where to store the experimental logs and output. + +`experimentDir `refers to a directory in the GCS bucket (specified by `gcsExperimentBucket`) for logging. In the example above,` pir-pythia-6.9b/training_logs/` is a directory already set up for logging in the shared GCS bucket `litgpt-public-bucket`. \ Alternatively, you can create your own directory (e.g. named `logDir`) under your own GCS bucket (e.g. named `myBucket`). Then the logs will be saved at the target location (e.g. `gs://myBucket`/`logDir`) designated by the parameter `experimentDir`. -`dataDir `refers to a directory in the GCS bucket (specified by `gcsBucket`) for training data. In the example above, `openwebtext_dataset` is a directory containing the training data in the shared GCS bucket `litgpt-public-bucket`. \ +`gcsDataBucket `refers to a GCS bucket where your data is stored. In the example above, `litgpt-public-bucket` contains the pre-copied versions of openwebtext dataset. Alternatively, you can use your own GCS bucket (e.g. named `myBucket`). + +`dataDir `refers to a directory in the GCS bucket (specified by `gcsDataBucket`) for training data. In the example above, `openwebtext_dataset` is a directory containing the training data in the shared GCS bucket `litgpt-public-bucket`. \ Alternatively, you can create your own directory (e.g. named `dataSourceDir`) under your own GCS bucket (e.g. named `myBucket`). Then the training data will be loaded from the source location (e.g. `gs://myBucket/dataSourceDir/train.bin` and `gs://myBucket/dataSourceDir/val.bin`) designated by the parameter `dataDir`. `image `refers to the Docker image set up for LitGPT. The value of this flag is in the following format: `-docker.pkg.dev///litgpt-full:` @@ -287,15 +294,15 @@ You can check the status of your workload via any of the following: **Note:** pod0 contains logs that other pods in the same experiment do not contain. -### MFU Calculation +### MFU Calculation MFU can be calculated by consuming the `metrics.csv` file output by LitGPT. During a training run this file can be found in the litgpt container at `/workspace/out/openwebtext/version_0/metrics.csv` . After training is completed this file will be uploaded as part of the `experimentDir` specified in the helm values. Step times are presented in the csv as aggregate times in the column `time/train`, so the value used should be -`time/train[n] - time/train[n-1]` . +`time/train[n] - time/train[n-1]`. -MFU for this sample workload can be calculated using the formula: +MFU for this sample workload can be calculated using the formula: ``` estimated_flops = ops_per_step * trainable_flops * batch_size @@ -312,7 +319,7 @@ For example, running Llama2-70b on 40 VMs would have you calculate this as: The MFU value is also available in the `metrics.csv` file after 50 iterations at column `throughput/device/mfu`, though we have seen inconsistent numbers reported and recommend calculating it manually. -## Troubleshooting +## Troubleshooting **Docker build issues** diff --git a/sample_workloads/lit-gpt-demo/build_and_push_litgpt.sh b/sample_workloads/lit-gpt-demo/build_and_push_litgpt.sh index 5d5c02a2..de2742f6 100644 --- a/sample_workloads/lit-gpt-demo/build_and_push_litgpt.sh +++ b/sample_workloads/lit-gpt-demo/build_and_push_litgpt.sh @@ -13,6 +13,9 @@ FULL_IMAGE=${FULL_IMAGE:="$ARTIFACT_REGISTRY/litgpt-full"} # Clone LitGPT and checkout a flash-attn enabled commit if [ ! -d $LITGPT_PATH ]; then git clone https://github.com/Lightning-AI/lit-gpt.git + cd lit-gpt + git checkout d5d371417ecb3d3b6c4f30837d8bb7cf2b5310ae + cd .. LITGPT_PATH=lit-gpt fi diff --git a/sample_workloads/lit-gpt-demo/helm/templates/litgpt.yaml b/sample_workloads/lit-gpt-demo/helm/templates/litgpt.yaml index 6ae47694..4216a68e 100644 --- a/sample_workloads/lit-gpt-demo/helm/templates/litgpt.yaml +++ b/sample_workloads/lit-gpt-demo/helm/templates/litgpt.yaml @@ -1,9 +1,10 @@ {{- $requiredVar := .Values.cluster.nNodes | required ".Values.cluster.nNodes is required" -}} {{- $requiredVar := .Values.cluster.nodePool | required ".Values.cluster.nodePool is required" -}} {{- $requiredVar := .Values.network.ncclIfnames | required ".Values.ncclIfnames is required" -}} -{{- $requiredVar := .Values.workload.gcsBucket | required ".Values.gcsBucket is required" -}} {{- $requiredVar := .Values.workload.jobTimestamp | required ".Values.jobTimestamp is required" -}} +{{- $requiredVar := .Values.workload.gcsExperimentBucket | required ".Values.gcsExperimentBucket is required" -}} {{- $requiredVar := .Values.workload.experimentDir | required ".Values.experimentDir is required" -}} +{{- $requiredVar := .Values.workload.gcsDataBucket | required ".Values.gcsDataBucket is required" -}} {{- $requiredVar := .Values.workload.dataDir| required ".Values.dataDir is required" -}} {{- $requiredVar := .Values.workload.image | required ".Values.image is required" -}} apiVersion: v1 @@ -145,10 +146,12 @@ spec: value: "{{$root.Values.network.disablePmtu}}" - name: CPU_PINNING_MODE value: "{{$root.Values.network.cpuPinningMode}}" - - name: GCS_BUCKET - value: "{{$root.Values.workload.gcsBucket}}" + - name: GCS_EXPERIMENT_BUCKET + value: "{{$root.Values.workload.gcsExperimentBucket}}" - name: EXPERIMENT_ROOT_DIR value: "{{$root.Values.workload.experimentDir}}" + - name: GCS_DATA_BUCKET + value: "{{$root.Values.workload.gcsDataBucket}}" - name: DATA_DIR value: "{{$root.Values.workload.dataDir}}" - name: BATCH_SIZE diff --git a/sample_workloads/lit-gpt-demo/helm/values.yaml.example b/sample_workloads/lit-gpt-demo/helm/values.yaml.example index afa87fba..dd966ed0 100644 --- a/sample_workloads/lit-gpt-demo/helm/values.yaml.example +++ b/sample_workloads/lit-gpt-demo/helm/values.yaml.example @@ -10,9 +10,10 @@ network: workload: - gcsBucket: jobTimestamp: + gcsExperimentBucket: experimentDir: + gcsDataBucket: dataDir: image: us-central1-docker.pkg.dev///litgpt-full: configDirInBucket: null diff --git a/sample_workloads/lit-gpt-demo/openwebtext_trainer.py b/sample_workloads/lit-gpt-demo/openwebtext_trainer.py index 2af4349c..7e9631d1 100644 --- a/sample_workloads/lit-gpt-demo/openwebtext_trainer.py +++ b/sample_workloads/lit-gpt-demo/openwebtext_trainer.py @@ -25,7 +25,7 @@ model_name = os.getenv("MODEL_NAME", "Llama-2-70b-hf") name = "openwebtext" -out_dir = Path("out") / name +out_dir = Path(os.getenv("EXPERIMENT_LOCAL_DIR", "")) / "out" data_dir = Path("/data") save_interval = 1000 eval_interval = 1000 @@ -123,7 +123,7 @@ def main(devices: int = 1, precision: Optional[str] = None, tpu: bool = False) - else: strategy = "auto" - logger = step_csv_logger("out", name, cls=CSVLogger, flush_logs_every_n_steps=log_interval) + logger = step_csv_logger(out_dir, name, cls=CSVLogger, flush_logs_every_n_steps=log_interval) speed_monitor = SpeedMonitorCallback( length_fn=lambda batch: batch[0].size(1), batch_size=micro_batch_size, window_size=50, time_unit="seconds" ) diff --git a/sample_workloads/lit-gpt-demo/scripts/litgpt_container_entrypoint.sh b/sample_workloads/lit-gpt-demo/scripts/litgpt_container_entrypoint.sh index 98c381a6..d0ff0103 100644 --- a/sample_workloads/lit-gpt-demo/scripts/litgpt_container_entrypoint.sh +++ b/sample_workloads/lit-gpt-demo/scripts/litgpt_container_entrypoint.sh @@ -5,23 +5,25 @@ set -o pipefail : "${MASTER_ADDR:?Must set MASTER_ADDR}" : "${NODE_RANK:?Must set NODE_RANK}" -: "${GCS_BUCKET:?Must set GCS_BUCKET}" : "${JOB_TIMESTAMP:?Must set JOB_TIMESTAMP}" -: "${EXPERIMENT_ROOT_DIR:?Must set EXPERIMENT_ROOT_DIR}" : "${NNODES:?Must set NNODES}" +: "${GCS_EXPERIMENT_BUCKET:?Must set GCS_EXPERIMENT_BUCKET}" +: "${EXPERIMENT_ROOT_DIR:?Must set EXPERIMENT_ROOT_DIR}" +: "${GCS_DATA_BUCKET:?Must set GCS_DATA_BUCKET}" : "${DATA_DIR:?Must set DATA_DIR}" -EXPERIMENT_LOCAL_DIR=/experiment/${EXPERIMENT_ROOT_DIR} +export EXPERIMENT_LOCAL_DIR=/experiment/${EXPERIMENT_ROOT_DIR} + mkdir -p $EXPERIMENT_LOCAL_DIR echo $EXPERIMENT_ROOT_DIR echo $EXPERIMENT_LOCAL_DIR -gsutil rsync -r gs://${GCS_BUCKET}/${EXPERIMENT_ROOT_DIR}/ ${EXPERIMENT_LOCAL_DIR}/ +gsutil rsync -r gs://${GCS_EXPERIMENT_BUCKET}/${EXPERIMENT_ROOT_DIR}/ ${EXPERIMENT_LOCAL_DIR}/ LOCAL_DATA_DIR=/data mkdir -p $LOCAL_DATA_DIR -gsutil -m rsync gs://${GCS_BUCKET}/${DATA_DIR} /data +gsutil -m rsync gs://${GCS_DATA_BUCKET}/${DATA_DIR} /data export MASTER_PORT=6002 export GPUS_PER_NODE=8 @@ -117,8 +119,8 @@ function on_script_completion { # semaphore to cleanly exit hardware utilization monitor touch /tmp/workload_terminated - echo "Uploading ${EXPERIMENT_LOCAL_DIR} to gs://${GCS_BUCKET}/${EXPERIMENT_ROOT_DIR}/" - gsutil rsync -r ${EXPERIMENT_LOCAL_DIR}/ gs://${GCS_BUCKET}/${EXPERIMENT_ROOT_DIR}/ + echo "Uploading ${EXPERIMENT_LOCAL_DIR} to gs://${GCS_EXPERIMENT_BUCKET}/${EXPERIMENT_ROOT_DIR}/" + gsutil rsync -r ${EXPERIMENT_LOCAL_DIR}/ gs://${GCS_EXPERIMENT_BUCKET}/${EXPERIMENT_ROOT_DIR}/ }