Skip to content

Commit

Permalink
Small fixes to Lit-GPT demo (#334)
Browse files Browse the repository at this point in the history
  • Loading branch information
gkroiz authored Nov 17, 2023
2 parents b6dc51a + ef46259 commit 6c5e3fe
Show file tree
Hide file tree
Showing 6 changed files with 68 additions and 52 deletions.
85 changes: 46 additions & 39 deletions sample_workloads/lit-gpt-demo/README.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,17 @@
## Overview
## Overview

This document provides instructions on running a sample PyTorch-based workload on A3 using TCPx, including the limitations with general PyTorch integration.


## Pre-Requisites
## Pre-Requisites

This guide assumes that you already have created a GKE cluster according to this repo with the proper GPU drivers and host images for TCPx.


## Limitations
## Limitations


### TCPx Limitations with Pytorch versions
### TCPx Limitations with Pytorch versions

TCPx currently supports a specific NCCL version, which limits the supported versions of Pytorch. The released TCPx binary officially supports NCCL version `2.18.1`, and an unreleased version `2.18.5unpack_memsyncapifix `based on [this commit.](https://github.com/NVIDIA/nccl/commit/321549b7d5e6039a86c0431d0c85e996f9f5fe12) This NCCL version will be installed on the host VM by the nccl-installer daemonset (v3.1.6_2023_10_06). \

Expand All @@ -23,12 +23,12 @@ If you are comfortable with using the unofficial nccl version then you can use a
Some testing has also been done on 2.17.1 (image versions [23.04-py3](http://nvcr.io/nvidia/pytorch:23.04-py3), [23.03-py3](http://nvcr.io/nvidia/pytorch:23.03-py3)) and it is functional, but not considered officially supported.


## LitGPT Sample Workload
## LitGPT Sample Workload

If you are building LitGPT from source, we recommend running these commands on a shell with plenty of Memory available (please read our [troubleshooting section](https://docs.google.com/document/d/14x-Lim29ZdcpudJalBn12sQVy9oqutlHr-QOJnvRwxA/edit?resourcekey=0-xfgzT7fofhl9K5qCwSRdLg#heading=h.1zt9nloo6lvr) for more information). If you are consuming the pre-built LitGPT image, then these commands can be run on any shell where you can install docker.


### Environment Setup
### Environment Setup



Expand All @@ -42,7 +42,7 @@ export CLUSTER_NAME=<name of GKE cluster>
export REGION=<region>
export PROJECT_ID=<project>
```


3. Install `kubectl` and fetch credentials for your GKE cluster.
```
Expand All @@ -52,16 +52,16 @@ gcloud container clusters get-credentials $CLUSTER_NAME --region $REGION --proje
```
4. Install Helm.
```
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh
sudo chmod +x /usr/local/bin/helm
```

### Set up Lit-GPT
### Set up Lit-GPT


### Use Pre-built Docker Image
### Use Pre-built Docker Image

A pre-built example for quickly running LitGPT is available as a sample workload in the [ai-infra-cluster-provisioning](https://github.com/GoogleCloudPlatform/ai-infra-cluster-provisioning/tree/develop/sample_workloads/lit-gpt-demo) repo. See [Run LitGPT](#run-lit-gpt) for the next set of instructions.

Expand Down Expand Up @@ -107,15 +107,15 @@ Several additional parameters are available in the helm values.yaml file when us



### Build Custom Docker Image
### Build Custom Docker Image

If you would rather modify and set up LitGPT on your own, for example if you want to add custom model configs or additional hyperparameter tuning, follow these steps to build the image from source.


#### Docker Image Setup
#### Docker Image Setup


##### Setup Artifact Registry
##### Setup Artifact Registry

Follow [https://cloud.google.com/artifact-registry/docs/repositories/create-repos](https://cloud.google.com/artifact-registry/docs/repositories/create-repos), make sure to create this for Docker images.

Expand All @@ -133,10 +133,10 @@ Set` $ARTIFACT_REGISTRY` to the Registry URL returned.
export ARTIFACT_REGISTRY=<artifact_registry>
```

**Note:** `ARTIFACT_REGISTRY `is generally in the format of `{LOCATION}-docker.pkg.dev/{PROJECT_ID}/{REGISTRY_NAME}`.
**Note:** `ARTIFACT_REGISTRY `is generally in the format of `{LOCATION}-docker.pkg.dev/{PROJECT_ID}/{REGISTRY_NAME}`.


### Setup Docker
### Setup Docker

We need to install docker since we plan to create our own docker images. Please refer to [https://docs.docker.com/engine/install/](https://docs.docker.com/engine/install/) for a docker installation guide. Once docker is installed, we need to setup docker with gcloud Please run the following (or follow [https://cloud.google.com/artifact-registry/docs/docker/authentication](https://cloud.google.com/artifact-registry/docs/docker/authentication)) \

Expand All @@ -146,7 +146,7 @@ gcloud auth configure-docker $LOCATION-docker.pkg.dev
```


### Setup Docker files and Scripts
### Setup Docker files and Scripts

Please clone the <code>[ai-infra-cluster-provisioning](https://github.com/GoogleCloudPlatform/ai-infra-cluster-provisioning)</code> repo.

Expand All @@ -170,7 +170,7 @@ sudo -E bash build_and_push_litgpt.sh
Once the command is done, a new **image** **tag** will be output in the console. Please keep a record of it, which will be used in the [Helm Config File Setup](#helm-config-file-setup).


### Setting up data
### Setting up data

This Lit-GPT training example uses the openwebtext dataset, which can be installed by following [https://github.com/Lightning-AI/lit-gpt/blob/main/tutorials/pretrain_openwebtext.md](https://github.com/Lightning-AI/lit-gpt/blob/main/tutorials/pretrain_openwebtext.md). Please upload these to a Google Cloud Storage (GCS) bucket ([https://cloud.google.com/storage/docs/creating-buckets](https://cloud.google.com/storage/docs/creating-buckets)).

Expand All @@ -182,17 +182,17 @@ Alternatively, you can find pre-copied versions of this data at:
**Note:** If you use the `litgpt-public-bucket` to load the dataset then you will not be able to upload your training run data to a GCS bucket. If you want GCS logs for your training run then copy those blobs to a bucket that you have write permissions to.


### Setup for distributed training
### Setup for distributed training

In the definition of the `Trainer` object ([https://github.com/Lightning-AI/lit-gpt/blob/main/pretrain/openwebtext_trainer.py#L123-L135](https://github.com/Lightning-AI/lit-gpt/blob/main/pretrain/openwebtext_trainer.py#L123-L135)), we need to add another argument: `num_nodes`. This should match the `nNodes` value in `helm/values.yaml`.

**Note:** Both of the requested code changes above are already present in [https://github.com/GoogleCloudPlatform/ai-infra-cluster-provisioning/blob/develop/sample_workloads/lit-gpt-demo/openwebtext_trainer.py](https://github.com/GoogleCloudPlatform/ai-infra-cluster-provisioning/blob/litgptparams/sample_workloads/lit-gpt-demo/openwebtext_trainer.py)


### Additional Changes to Lit-GPT code
### Additional Changes to Lit-GPT code


#### Add New Model Configurations
#### Add New Model Configurations

To change the model configuration, please change [https://github.com/Lightning-AI/lit-gpt/blob/main/pretrain/openwebtext_trainer.py#L24](https://github.com/Lightning-AI/lit-gpt/blob/main/pretrain/openwebtext_trainer.py#L24). You can change this model_name to any name in [https://github.com/Lightning-AI/lit-gpt/blob/main/lit_gpt/config.py](https://github.com/Lightning-AI/lit-gpt/blob/main/lit_gpt/config.py). We also recommend adding your own configurations.

Expand All @@ -207,21 +207,21 @@ transformers = [



#### Hyperparameter changes
#### Hyperparameter changes

Please take a look at [https://github.com/Lightning-AI/lit-gpt/blob/main/pretrain/openwebtext_trainer.py#L24-L46](https://github.com/Lightning-AI/lit-gpt/blob/main/pretrain/openwebtext_trainer.py#L24-L46)

If you want to customize hyperparameters or parts of the model, please either (1) make the adjustments in the lit-gpt source code or (2) add some flags to adjust in the command line. Look at `litgpt_container_entrypoint.sh` for where exactly the training script is being called.


### Run Lit-GPT
### Run Lit-GPT

**Note: **If there are any changes to `lit-gpt`, please build and push a newer docker image.


### Helm Config File Setup
### Helm Config File Setup

Next, create a copy of `helm/values.yaml.example` without the `.example` ending.
Next, create a copy of `helm/values.yaml.example` without the `.example` ending.

```
cp helm/values.yaml.example helm/values.yaml
Expand All @@ -240,12 +240,16 @@ network:
rxdmContainer: "us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/tcpgpudmarxd:v2.0.7"
disablePmtu: "yes"
workload:
gcsBucket: litgpt-public-bucket
jobTimestamp: 1
experimentDir: pir-pythia-6.9b/training_logs/
dataDir: openwebtext_dataset
image: us-docker.pkg.dev/gce-ai-infra/litgpt-full/litgpt
jobTimestamp: <int: add a timestamp here or unique identifier>
gcsExperimentBucket: <str: your gcs bucket where experiment logs should go>
experimentDir: <str: root gcs directory of experiment>
gcsDataBucket: <str: your gcs bucket where data is located>
dataDir: <str: location with gcs bucket of where train.bin and val.bin are located>
image: us-central1-docker.pkg.dev/<YOUR PROJECT ID>/<ARTIFACT REGISTRY NAME>/litgpt-full:<ADD TAG HERE>
configDirInBucket: null
batchSize: 6
microBatchSize: 6
modelName: Llama-2-70b-hf
```

In the helm config file `values.yaml`, you need to make changes to the following seven flags based on the workload requirements:
Expand All @@ -254,24 +258,27 @@ In the helm config file `values.yaml`, you need to make changes to the following

* `nodePool`
* `nNodes`
* `gcsBucket`
* `jobTimestamp`
* `gcsExperimentBucket`
* `experimentDir`
* `gcsDataBucket`
* `dataDir`
* `image`

`nodePool` refers to the name of the GKE NodePool where the LitGPT job will be run.

`nNodes` refers to the number of GKE GPU nodes in the GKE NodePool (specified by `nodePool`) for running the LitGPT job. Note that the value of `nNodes` cannot exceed the total number of GKE GPU nodes in that NodePool.

`gcsBucket `refers to a GCS bucket. In the example above, `litgpt-public-bucket` contains the pre-copied versions of openwebtext dataset. Alternatively, you can use your own GCS bucket (e.g. named `myBucket`).

`jobTimestamp `needs to be a timestamp or a unique identifier.

`experimentDir `refers to a directory in the GCS bucket (specified by `gcsBucket`) for logging. In the example above,` pir-pythia-6.9b/training_logs/` is a directory already set up for logging in the shared GCS bucket `litgpt-public-bucket`. \
`gcsExperimentBucket `refers to a GCS bucket where to store the experimental logs and output.

`experimentDir `refers to a directory in the GCS bucket (specified by `gcsExperimentBucket`) for logging. In the example above,` pir-pythia-6.9b/training_logs/` is a directory already set up for logging in the shared GCS bucket `litgpt-public-bucket`. \
Alternatively, you can create your own directory (e.g. named `logDir`) under your own GCS bucket (e.g. named `myBucket`). Then the logs will be saved at the target location (e.g. `gs://myBucket`/`logDir`) designated by the parameter `experimentDir`.

`dataDir `refers to a directory in the GCS bucket (specified by `gcsBucket`) for training data. In the example above, `openwebtext_dataset` is a directory containing the training data in the shared GCS bucket `litgpt-public-bucket`. \
`gcsDataBucket `refers to a GCS bucket where your data is stored. In the example above, `litgpt-public-bucket` contains the pre-copied versions of openwebtext dataset. Alternatively, you can use your own GCS bucket (e.g. named `myBucket`).

`dataDir `refers to a directory in the GCS bucket (specified by `gcsDataBucket`) for training data. In the example above, `openwebtext_dataset` is a directory containing the training data in the shared GCS bucket `litgpt-public-bucket`. \
Alternatively, you can create your own directory (e.g. named `dataSourceDir`) under your own GCS bucket (e.g. named `myBucket`). Then the training data will be loaded from the source location (e.g. `gs://myBucket/dataSourceDir/train.bin` and `gs://myBucket/dataSourceDir/val.bin`) designated by the parameter `dataDir`.

`image `refers to the Docker image set up for LitGPT. The value of this flag is in the following format: `<repository_location>-docker.pkg.dev/<project>/<repository_name>/litgpt-full:<tag>`
Expand All @@ -287,15 +294,15 @@ You can check the status of your workload via any of the following:
**Note:** pod0 contains logs that other pods in the same experiment do not contain.


### MFU Calculation
### MFU Calculation

MFU can be calculated by consuming the `metrics.csv` file output by LitGPT. During a training run this file can be found in the litgpt container at `/workspace/out/openwebtext/version_0/metrics.csv` . After training is completed this file will be uploaded as part of the `experimentDir` specified in the helm values.

Step times are presented in the csv as aggregate times in the column `time/train`, so the value used should be
`time/train[n] - time/train[n-1]` .
`time/train[n] - time/train[n-1]`.


MFU for this sample workload can be calculated using the formula:
MFU for this sample workload can be calculated using the formula:

```
estimated_flops = ops_per_step * trainable_flops * batch_size
Expand All @@ -312,7 +319,7 @@ For example, running Llama2-70b on 40 VMs would have you calculate this as:
The MFU value is also available in the `metrics.csv` file after 50 iterations at column `throughput/device/mfu`, though we have seen inconsistent numbers reported and recommend calculating it manually.


## Troubleshooting
## Troubleshooting

**Docker build issues**

Expand Down
3 changes: 3 additions & 0 deletions sample_workloads/lit-gpt-demo/build_and_push_litgpt.sh
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,9 @@ FULL_IMAGE=${FULL_IMAGE:="$ARTIFACT_REGISTRY/litgpt-full"}
# Clone LitGPT and checkout a flash-attn enabled commit
if [ ! -d $LITGPT_PATH ]; then
git clone https://github.com/Lightning-AI/lit-gpt.git
cd lit-gpt
git checkout d5d371417ecb3d3b6c4f30837d8bb7cf2b5310ae
cd ..
LITGPT_PATH=lit-gpt
fi

Expand Down
9 changes: 6 additions & 3 deletions sample_workloads/lit-gpt-demo/helm/templates/litgpt.yaml
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
{{- $requiredVar := .Values.cluster.nNodes | required ".Values.cluster.nNodes is required" -}}
{{- $requiredVar := .Values.cluster.nodePool | required ".Values.cluster.nodePool is required" -}}
{{- $requiredVar := .Values.network.ncclIfnames | required ".Values.ncclIfnames is required" -}}
{{- $requiredVar := .Values.workload.gcsBucket | required ".Values.gcsBucket is required" -}}
{{- $requiredVar := .Values.workload.jobTimestamp | required ".Values.jobTimestamp is required" -}}
{{- $requiredVar := .Values.workload.gcsExperimentBucket | required ".Values.gcsExperimentBucket is required" -}}
{{- $requiredVar := .Values.workload.experimentDir | required ".Values.experimentDir is required" -}}
{{- $requiredVar := .Values.workload.gcsDataBucket | required ".Values.gcsDataBucket is required" -}}
{{- $requiredVar := .Values.workload.dataDir| required ".Values.dataDir is required" -}}
{{- $requiredVar := .Values.workload.image | required ".Values.image is required" -}}
apiVersion: v1
Expand Down Expand Up @@ -145,10 +146,12 @@ spec:
value: "{{$root.Values.network.disablePmtu}}"
- name: CPU_PINNING_MODE
value: "{{$root.Values.network.cpuPinningMode}}"
- name: GCS_BUCKET
value: "{{$root.Values.workload.gcsBucket}}"
- name: GCS_EXPERIMENT_BUCKET
value: "{{$root.Values.workload.gcsExperimentBucket}}"
- name: EXPERIMENT_ROOT_DIR
value: "{{$root.Values.workload.experimentDir}}"
- name: GCS_DATA_BUCKET
value: "{{$root.Values.workload.gcsDataBucket}}"
- name: DATA_DIR
value: "{{$root.Values.workload.dataDir}}"
- name: BATCH_SIZE
Expand Down
3 changes: 2 additions & 1 deletion sample_workloads/lit-gpt-demo/helm/values.yaml.example
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,10 @@ network:


workload:
gcsBucket: <str: your gcs bucket where data is located>
jobTimestamp: <int: add a timestamp here or unique identifier>
gcsExperimentBucket: <str: your gcs bucket where experiment logs should go>
experimentDir: <str: root gcs directory of experiment>
gcsDataBucket: <str: your gcs bucket where data is located>
dataDir: <str: location with gcs bucket of where train.bin and val.bin are located>
image: us-central1-docker.pkg.dev/<YOUR PROJECT ID>/<ARTIFACT REGISTRY NAME>/litgpt-full:<ADD TAG HERE>
configDirInBucket: null
Expand Down
4 changes: 2 additions & 2 deletions sample_workloads/lit-gpt-demo/openwebtext_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@

model_name = os.getenv("MODEL_NAME", "Llama-2-70b-hf")
name = "openwebtext"
out_dir = Path("out") / name
out_dir = Path(os.getenv("EXPERIMENT_LOCAL_DIR", "")) / "out"
data_dir = Path("/data")
save_interval = 1000
eval_interval = 1000
Expand Down Expand Up @@ -123,7 +123,7 @@ def main(devices: int = 1, precision: Optional[str] = None, tpu: bool = False) -
else:
strategy = "auto"

logger = step_csv_logger("out", name, cls=CSVLogger, flush_logs_every_n_steps=log_interval)
logger = step_csv_logger(out_dir, name, cls=CSVLogger, flush_logs_every_n_steps=log_interval)
speed_monitor = SpeedMonitorCallback(
length_fn=lambda batch: batch[0].size(1), batch_size=micro_batch_size, window_size=50, time_unit="seconds"
)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,23 +5,25 @@ set -o pipefail

: "${MASTER_ADDR:?Must set MASTER_ADDR}"
: "${NODE_RANK:?Must set NODE_RANK}"
: "${GCS_BUCKET:?Must set GCS_BUCKET}"
: "${JOB_TIMESTAMP:?Must set JOB_TIMESTAMP}"
: "${EXPERIMENT_ROOT_DIR:?Must set EXPERIMENT_ROOT_DIR}"
: "${NNODES:?Must set NNODES}"
: "${GCS_EXPERIMENT_BUCKET:?Must set GCS_EXPERIMENT_BUCKET}"
: "${EXPERIMENT_ROOT_DIR:?Must set EXPERIMENT_ROOT_DIR}"
: "${GCS_DATA_BUCKET:?Must set GCS_DATA_BUCKET}"
: "${DATA_DIR:?Must set DATA_DIR}"

EXPERIMENT_LOCAL_DIR=/experiment/${EXPERIMENT_ROOT_DIR}
export EXPERIMENT_LOCAL_DIR=/experiment/${EXPERIMENT_ROOT_DIR}

mkdir -p $EXPERIMENT_LOCAL_DIR

echo $EXPERIMENT_ROOT_DIR
echo $EXPERIMENT_LOCAL_DIR

gsutil rsync -r gs://${GCS_BUCKET}/${EXPERIMENT_ROOT_DIR}/ ${EXPERIMENT_LOCAL_DIR}/
gsutil rsync -r gs://${GCS_EXPERIMENT_BUCKET}/${EXPERIMENT_ROOT_DIR}/ ${EXPERIMENT_LOCAL_DIR}/

LOCAL_DATA_DIR=/data
mkdir -p $LOCAL_DATA_DIR
gsutil -m rsync gs://${GCS_BUCKET}/${DATA_DIR} /data
gsutil -m rsync gs://${GCS_DATA_BUCKET}/${DATA_DIR} /data

export MASTER_PORT=6002
export GPUS_PER_NODE=8
Expand Down Expand Up @@ -117,8 +119,8 @@ function on_script_completion {
# semaphore to cleanly exit hardware utilization monitor
touch /tmp/workload_terminated

echo "Uploading ${EXPERIMENT_LOCAL_DIR} to gs://${GCS_BUCKET}/${EXPERIMENT_ROOT_DIR}/"
gsutil rsync -r ${EXPERIMENT_LOCAL_DIR}/ gs://${GCS_BUCKET}/${EXPERIMENT_ROOT_DIR}/
echo "Uploading ${EXPERIMENT_LOCAL_DIR} to gs://${GCS_EXPERIMENT_BUCKET}/${EXPERIMENT_ROOT_DIR}/"
gsutil rsync -r ${EXPERIMENT_LOCAL_DIR}/ gs://${GCS_EXPERIMENT_BUCKET}/${EXPERIMENT_ROOT_DIR}/
}


Expand Down

0 comments on commit 6c5e3fe

Please sign in to comment.