From 886b6bc911ee24f6e0987aa0cab8571005fea422 Mon Sep 17 00:00:00 2001 From: Hira <235995+nhira@users.noreply.github.com> Date: Sat, 24 Aug 2024 11:06:37 -0500 Subject: [PATCH] =?UTF-8?q?Added=20advanced=20usage=20example=20for=20a=20?= =?UTF-8?q?notebook=20interacting=20with=20a=20Cloud=20=E2=80=A6=20(#178)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * Added advanced usage example for a notebook interacting with a Cloud TPU cluster * Added advanced usage example for a notebook interacting with a Cloud TPU cluster --- README.md | 15 ++- xpk-notebooks.md | 340 +++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 349 insertions(+), 6 deletions(-) create mode 100644 xpk-notebooks.md diff --git a/README.md b/README.md index d0def00..1ab62bb 100644 --- a/README.md +++ b/README.md @@ -497,7 +497,7 @@ Check out [MaxText example](https://github.com/google/maxtext/pull/570) on how t --cluster xpk-test --filter-by-job=$USER ``` -* Workload List supports waiting for the completion of a specific job. XPK will follow an existing job until it has finished or the `timeout`, if provided, has been reached and then list the job. If no `timeout` is specified, the default value is set to the max value, 1 week. You may also set `timeout=0` to poll the job once. +* Workload List supports waiting for the completion of a specific job. XPK will follow an existing job until it has finished or the `timeout`, if provided, has been reached and then list the job. If no `timeout` is specified, the default value is set to the max value, 1 week. You may also set `timeout=0` to poll the job once. (Note: `restart-on-user-code-failure` must be set when creating the workload otherwise the workload will always finish with `Completed` status.) @@ -516,11 +516,11 @@ when creating the workload otherwise the workload will always finish with `Compl --timeout=300 ``` - Return codes - `0`: Workload finished and completed successfully. - `124`: Timeout was reached before workload finished. - `125`: Workload finished but did not complete successfully. - `1`: Other failure. + Return codes + `0`: Workload finished and completed successfully. + `124`: Timeout was reached before workload finished. + `125`: Workload finished but did not complete successfully. + `1`: Other failure. ## Inspector * Inspector provides debug info to understand cluster health, and why workloads are not running. @@ -1078,3 +1078,6 @@ To explore the stack traces collected in a temporary directory in Kubernetes Pod --workload xpk-test-workload --command "python3 main.py" --cluster \ xpk-test --tpu-type=v5litepod-16 --deploy-stacktrace-sidecar ``` + +# Other advanced usage +[Use a Jupyter notebook to interact with a Cloud TPU cluster](xpk-notebooks.md) diff --git a/xpk-notebooks.md b/xpk-notebooks.md new file mode 100644 index 0000000..6865333 --- /dev/null +++ b/xpk-notebooks.md @@ -0,0 +1,340 @@ + + +# Advanced usage - Use a Jupyter notebook to interact with a Cloud TPU cluster + +[Return to README](README.md#other-advanced-usage) + +## Introduction +One of the challenges researchers face when working with contemporary models is the distributed programming involved to orchestrate work with a complex architecture. This example shows you how to use XPK to create a Cloud TPU v5e-256 cluster and interact with it using a Jupyter notebook. + +## Assumptions +You need to ensure you have the TPU capacity (quotas and limits) for this activity. You may need to change machine names and shapes to make this work. + +To interact with the cluster, we use IPython Parallels and some [cell magic](https://ipyparallel.readthedocs.io/en/latest/tutorial/magics.html). IPython Parallels (ipyparallel) is a Python package and collection of CLI scripts for controlling clusters of IPython processes, built on the Jupyter protocol. While the default settings were adequate for this example, you should review [ipyparallel security details](https://ipyparallel.readthedocs.io/en/latest/reference/security.html) before use in a production environment. +We do most of this work from a Cloud Shell instance. We will use some environment variables to make life easier. +```shell +export PROJECTID=${GOOGLE_CLOUD_PROJECT} +export CLUSTER= # your cluster name +export REGION= # region for cluster +export ZONE= # zone for cluster +``` + +## Cluster creation +### Optional: high-MTU network +If you need to work with multiple TPU slices, it will be useful to create a high-MTU network as shown here (the remaining steps assume you do): +https://github.com/google/maxtext/tree/main/MaxText/configs#create-a-custom-mtu-network +```shell +gcloud compute networks create mtu9k --mtu=8896 \ +--project=${PROJECTID} --subnet-mode=auto \ +--bgp-routing-mode=regional + +gcloud compute firewall-rules create mtu9kfw --network mtu9k \ +--allow tcp,icmp,udp --project=${PROJECTID} +``` + +### XPK create cluster +Install XPK. (You know, this repo!) + +Create a GKE Cloud TPU cluster using XPK. +```shell +xpk cluster create --cluster ${CLUSTER} \ +--project=${PROJECTID} --default-pool-cpu-machine-type=n2-standard-8 \ +--num-slices=1 --tpu-type=v5litepod-256 --zone=${ZONE} \ +--spot --custom-cluster-arguments="--network=mtu9k --subnetwork=mtu9k" + +# if you need to delete this cluster to fix errors +xpk cluster delete --cluster ${CLUSTER} --zone=${ZONE} +``` + +## Add storage +Enable filestore plugin so we can use an NFS Filestore instance for shared storage. (This may take 20-30 minutes.) +```shell +gcloud container clusters update ${CLUSTER} \ +--region ${REGION} --project ${PROJECTID} \ +--update-addons=GcpFilestoreCsiDriver=ENABLED +``` + +### Filestore instance +Create a regional NFS [Filestore instance](https://cloud.google.com/filestore/docs/creating-instances#google-cloud-console) in ``${REGION}`` and the named network above. + +Note the instance ID and file share name you’ve used. You will need to wait until this instance is available to continue. + + +### Persistent volumes +Once the Filestore instance is up, create a file with the correct names and storage size so you can create a persistent volume for the cluster. You will need to update the volumeHandle and volumeAttributes below. You will also need to change the names to match. +```yaml +# persistent-volume.yaml +apiVersion: v1 +kind: PersistentVolume +metadata: + name: opmvol +spec: + storageClassName: "" + capacity: + storage: 1Ti + accessModes: + - ReadWriteMany + persistentVolumeReclaimPolicy: Retain + volumeMode: Filesystem + csi: + driver: filestore.csi.storage.gke.io + volumeHandle: "modeInstance/${ZONE}/nfs-opm-ase/nfs_opm_ase" + volumeAttributes: + ip: 10.243.23.194 + volume: nfs_opm_ase +--- +kind: PersistentVolumeClaim +apiVersion: v1 +metadata: + name: opmvol-claim +spec: + accessModes: + - ReadWriteMany + storageClassName: "" + volumeName: opmvol + resources: + requests: + storage: 1T +``` + +Apply the change. Be sure to get the cluster credentials first if you haven’t already done that. +```shell +# get cluster credentials if needed +# gcloud container clusters get-credentials ${CLUSTER} --region ${REGION} --project ${PROJECTID} +# kubectl get nodes + +# add the storage to the cluster +kubectl apply -f persistent-volume.yaml +``` + +If it worked, you should see the volume listed. +```shell +kubectl get pv +kubectl get pvc +``` + +## Build Docker image for IPP nodes +We will start with the MaxText image because we want to train an LLM. +```shell +# get the code +git clone "https://github.com/google/maxtext" +``` + +We’ll start with a JAX stable image for TPUs and then update the build specification to include ipyparallel. Edit the ``requirements_with_jax_stable_stack.txt`` to add this at the bottom. +```shell +# also include IPyParallel +ipyparallel +``` + +Build the image and upload it so we can use the image to spin up pods. Note the resulting image name. It should be something like ``gcr.io/${PROJECTID}/opm_ipp_runner/tpu``. +```shell +# use docker build to build the image and upload it +# NOTE: you may need to change the upload repository +bash ./docker_maxtext_jax_stable_stack_image_upload.sh PROJECT_ID=${PROJECTID} \ + BASEIMAGE=us-docker.pkg.dev/${PROJECTID}/jax-stable-stack/tpu:jax0.4.30-rev1 \ + CLOUD_IMAGE_NAME=opm_ipp_runner IMAGE_TAG=latest \ + MAXTEXT_REQUIREMENTS_FILE=requirements_with_jax_stable_stack.txt + +# confirm the image is available +# docker image list gcr.io/${PROJECTID}/opm_ipp_runner/tpu:latest +``` + +## Set up LWS +We use the LeaderWorkerSet for these IPP pods, so they are managed collectively. +```shell +kubectl apply --server-side -f https://github.com/kubernetes-sigs/lws/releases/download/v0.3.0/manifests.yaml +``` + +## Set up IPP deployment +Next we set up an LWS pod specification for our IPP instances. Create an ``ipp-deployment.yaml`` file. +You will need to update the volume mounts and the container image references. (You should also change the password.) +```yaml +# ipp-deployment.yaml +apiVersion: leaderworkerset.x-k8s.io/v1 +kind: LeaderWorkerSet +metadata: + name: ipp-deployment + annotations: + leaderworkerset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool +spec: + replicas: 1 + leaderWorkerTemplate: + size: 65 + restartPolicy: RecreateGroupOnPodRestart + leaderTemplate: + metadata: + labels: + app: ipp-controller + spec: + securityContext: + runAsUser: 1000 + runAsGroup: 100 + fsGroup: 100 + nodeSelector: + cloud.google.com/gke-tpu-topology: 16x16 + cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice + tolerations: + - key: "google.com/tpu" + operator: "Exists" + effect: "NoSchedule" + containers: + - name: jupyter-notebook-server + image: jupyter/base-notebook:latest + args: ["start-notebook.sh", "--NotebookApp.allow_origin='https://colab.research.google.com'", "--NotebookApp.port_retries=0"] + resources: + limits: + cpu: 1000m + memory: 1Gi + requests: + cpu: 100m + memory: 500Mi + ports: + - containerPort: 8888 + name: http-web-svc + volumeMounts: + - name: opmvol + mountPath: /home/jovyan/nfs # jovyan is the default user + - name: ipp-controller + image: gcr.io/${PROJECTID}/opm_ipp_runner/tpu + command: + - bash + - -c + - | + ip=$(hostname -I | awk '{print $1}') + echo $ip + ipcontroller --ip="$ip" --profile-dir=/app/ipp --log-level=ERROR --ping 10000 + volumeMounts: + - name: opmvol + mountPath: /app/ipp + volumes: + - name: opmvol + persistentVolumeClaim: + claimName: opmvol-claim + + workerTemplate: + spec: + nodeSelector: + cloud.google.com/gke-tpu-topology: 16x16 + cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice + containers: + + - name: ipp-engine + image: gcr.io/${PROJECTID}/opm_ipp_runner/tpu + ports: + - containerPort: 8471 # Default port using which TPU VMs communicate + securityContext: + privileged: true + command: + - bash + - -c + - | + sleep 20 + ipengine --file="/app/ipp/security/ipcontroller-engine.json" --timeout 5.0 + resources: + requests: + google.com/tpu: 4 + limits: + google.com/tpu: 4 + volumeMounts: + - name: opmvol + mountPath: /app/ipp + volumes: + - name: opmvol + persistentVolumeClaim: + claimName: opmvol-claim +``` + +Add the resource to the GKE cluster. +```shell +kubectl apply -f ipp-deployment.yaml + +# to view pod status as they come up +# kubectl get pods +``` +Add a service to expose it. + +Create ``ipp-service.yaml`` +```yaml +# ipp-service.yaml +apiVersion: v1 +kind: Service +metadata: + name: ipp +spec: + selector: + app: ipp-controller + ports: + - protocol: TCP + port: 8888 + targetPort: 8888 + type: ClusterIP #LoadBalancer +``` + +Deploy the new service. +```shell +kubectl apply -f ipp-service.yaml +``` + +If the pods don’t come up as a multihost cluster, you may need to correct the number of hosts depending on the number of chips (e.g., a v5e-256 should have an LWS size of 65 (64 ipp-engines and 1 ipp-controller)). If you need to look at a single container in isolation, you can use something like this. +```shell +# you should NOT have to do this +# kubectl exec ipp-deployment-0-2 -c ipp-engine -- python3 -c "import jax; jax.device_count()" +``` + +To correct errors, you can re-apply an updated template and re-create the leader pod. +```shell +# to fetch an updated docker image without changing anything else +# kubectl delete pod ipp-deployment-0 + +# to update the resource definition (automatically re-creates pods) +# kubectl apply -f ipp-deployment.yaml + +# to update the resource definition after an immutable change, you will likely need to use Console +# (i.e., delete Workloads lws-controller-manager, ipp, and ipp-deployment) +# and then you'll also need to delete the resource +# kubectl delete leaderworkerset/ipp-deployment +# kubectl delete service/ipp +``` + +## Optional: optimize networking +If you did create a high-MTU network, you should use the MaxText [preflight.sh](https://github.com/google/maxtext/blob/main/preflight.sh) script (which invokes another script) to tune the network settings for the pods before using them with the notebook (the MaxText reference training scripts automatically do this). +```shell +for pod in $(kubectl get pods --no-headers --output jsonpath="{range.items[*]}{..metadata.name}{'\n'}{end}" | grep ipp-deployment-0-); \ +do \ + echo "${pod}"; + kubectl exec ${pod} -c ipp-engine -- bash ./preflight.sh; +done +``` + +## Use the notebook +Get the link to the notebook … +```shell +kubectl logs ipp-deployment-0 --container jupyter-notebook-server + +# see the line that shows something like this +#http://127.0.0.1:8888/lab?token=1c9012cd239e13b2123028ae26436d2580a7d4fc1d561125 +``` + +Setup local port forwarding to your service so requests from your browser are ultimately routed to your Jupyter service. +```shell +# you will need to do this locally (e.g., laptop), so you probably need to +# gcloud container clusters get-credentials ${CLUSTER} --region ${REGION} --project ${PROJECTID} +kubectl port-forward service/ipp 8888:8888 + +# Example notebook +# https://gist.github.com/nhira/ea4b93738aadb1111b2ee5868d56a22b +```