Skip to content

Installing and Testing OpenShift fms‐hf‐tuning Stack

James Busche edited this page Sep 6, 2024 · 30 revisions

Table of Contents

Note, the steps below were written by [email protected] based on his knowledge while working on the project. Since then, I see the official instructions begin here: https://opendatahub.io/docs/installing-open-data-hub/#installing-odh-v2_installv2

with customization for kueue and running a tuning job here: https://opendatahub.io/docs/working-with-distributed-workloads/

Installing and Testing OpenShift fms-hf-tuning Stack

0. Prerequisites

0.1 OpenShift Cluster up and running. (I've been using OpenShift 4.14.17)

0.2 Logged onto the OpenShift UI. Note, I install everything from the command line, but I need the UI to get the "Copy login command" to get the oc login token.

0.3 Also logged into the terminal with oc login: For example:

oc login --token=sha256~OgYOYAA0ONu.... --server=https://api.jim414.cp.fyre.ibm.com:6443

1. Install ODH with Fast Channel

Using your terminal where you're logged in with oc login, issue this command:

cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  labels:
    operators.coreos.com/opendatahub-operator.openshift-operators: ""
  name: opendatahub-operator
  namespace: openshift-operators
spec:
  channel: fast
  installPlanApproval: Automatic
  name: opendatahub-operator
  source: community-operators
  sourceNamespace: openshift-marketplace
  startingCSV: opendatahub-operator.v2.17.0
EOF

You can check it started with:

watch oc get pods -n openshift-operators

2. Install the DSCI prerequisite Operators

2.1 Install service mesh

cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  labels:
    operators.coreos.com/servicemeshoperator.openshift-operators: ""
  name: servicemeshoperator
  namespace: openshift-operators
spec:
  channel: stable
  installPlanApproval: Automatic
  name: servicemeshoperator
  source: redhat-operators
  sourceNamespace: openshift-marketplace
  startingCSV: servicemeshoperator.v2.6.1
EOF

And then check it with:

watch oc get pods -n openshift-operators

2.2 Install Authorino Operator

cat << EOF | oc apply -f -
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  labels:
    operators.coreos.com/authorino-operator.openshift-operators: ""
  name: authorino-operator
  namespace: openshift-operators
spec:
  channel: stable
  installPlanApproval: Automatic
  name: authorino-operator
  source: community-operators
  sourceNamespace: openshift-marketplace
  startingCSV: authorino-operator.v0.11.1
EOF

And then check it with:

watch oc get pods -n openshift-operators

3. Install DSCI

cat << EOF | oc apply -f -
kind: DSCInitialization
apiVersion: dscinitialization.opendatahub.io/v1
metadata:
  name: default-dsci
spec:
  applicationsNamespace: opendatahub
  monitoring:
    managementState: Managed
    namespace: opendatahub
  serviceMesh:
    auth:
      audiences:
      - https://kubernetes.default.svc
    controlPlane:
      metricsCollection: Istio
      name: data-science-smcp
      namespace: istio-system
    managementState: Managed
  trustedCABundle:
    customCABundle: ""
    managementState: Managed
EOF

And then check it: (It should go into "Ready" state after about a minute or so)

watch oc get dsci

Also note that you'll see the istio control pane start up as well here:

oc get pods -n openshift-operators

4. Install the DSC

cat << EOF | oc apply -f -
apiVersion: datasciencecluster.opendatahub.io/v1
kind: DataScienceCluster
metadata:
  name: default-dsc
spec:
  components:
    codeflare:
      managementState: Removed
    dashboard:
      managementState: Managed
    datasciencepipelines:
      managementState: Removed
    kserve:
      managementState: Removed
      serving:
        ingressGateway:
          certificate:
            type: SelfSigned
        managementState: Managed
        name: knative-serving
    kueue:
      managementState: Managed
    modelmeshserving:
      managementState: Removed
    modelregistry:
      managementState: Removed
    ray:
      managementState: Removed
    trainingoperator:
      managementState: Managed
    trustyai:
      managementState: Removed
    workbenches:
      managementState: Removed
EOF

Check that the pods are running:

watch oc get pods -n opendatahub

You should see these pods:

oc get pods -n opendatahub
NAME                                         READY   STATUS    RESTARTS   AGE
kubeflow-training-operator-dc9cf9bb5-595xx   1/1     Running   0          4h50m
kueue-controller-manager-66768ccc94-4xq4v    1/1     Running   0          4h51m
odh-dashboard-5969fd7b5b-gd6rt               2/2     Running   0          4h51m
odh-dashboard-5969fd7b5b-xd7qj               2/2     Running   0          4h51m

Note, if you're having pull issues from docker.io, you can change your deployment to pull from quay.io instead with this:

oc set image deployment kubeflow-training-operator training-operator=quay.io/jbusche/training-operator:v1-855e096 -n opendatahub

Note: the initContainer pulls from docker.io/alpine:3.10 automatically, which causes trouble on some clusters that are ratelimited to Docker.io. To get around this, you can run the following command to patch the training-operator to use a different repo for the initContainer:

oc patch deployment kubeflow-training-operator -n opendatahub --type='json' -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/command", "value": ["/manager",  "--pytorch-init-container-image=quay.io/jbusche/alpine:3.10"]}]'

5. Configure your Kueue minimum requirements:

cat <<EOF | kubectl apply -f -
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: "cpu-flavor"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "cq-small"
spec:
  namespaceSelector: {} # match all.
  resourceGroups:
  - coveredResources: ["cpu", "memory"]
    flavors:
    - name: "cpu-flavor"
      resources:
      - name: "cpu"
        nominalQuota: 5
      - name: "memory"
        nominalQuota: 20Gi
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  name: lq-trainer
  namespace: default
spec:
  clusterQueue: cq-small
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
  name: p1
value: 30000
description: "high priority"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
  name: p2
value: 10000
description: "low priority"
EOF

6. Testing

I've been using Ted's script, changing the image tag depending on the fms-hf-tuning image that we want to use.

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ConfigMap
metadata:
  name: my-config
  namespace: default
data:
  config.json: |
    {
      "accelerate_launch_args": {
        "num_machines": 1,
        "num_processes": 2
      },
      "model_name_or_path": "bigscience/bloom-560m",
      "training_data_path": "/etc/config/twitter_complaints_small.json",
      "output_dir": "/tmp/out",
      "num_train_epochs": 1.0,
      "per_device_train_batch_size": 4,
      "per_device_eval_batch_size": 4,
      "gradient_accumulation_steps": 4,
      "eval_strategy": "no",
      "save_strategy": "epoch",
      "learning_rate": 1e-5,
      "weight_decay": 0.0,
      "lr_scheduler_type": "cosine",
      "logging_steps": 1.0,
      "packing": false,
      "include_tokens_per_second": true,
      "response_template": "\n### Label:",
      "dataset_text_field": "output",
      "use_flash_attn": false,
      "torch_dtype": "float32",
      "peft_method": "pt",
      "tokenizer_name_or_path": "bigscience/bloom"
    }
  twitter_complaints_small.json: |
    {"Tweet text":"@HMRCcustomers No this is my first job","ID":0,"Label":2,"text_label":"no complaint","output":"### Text: @HMRCcustomers No this is my first job\n\n### Label: no complaint"}
    {"Tweet text":"@KristaMariePark Thank you for your interest! If you decide to cancel, you can call Customer Care at 1-800-NYTIMES.","ID":1,"Label":2,"text_label":"no complaint","output":"### Text: @KristaMariePark Thank you for your interest! If you decide to cancel, you can call Customer Care at 1-800-NYTIMES.\n\n### Label: no complaint"}
    {"Tweet text":"If I can't get my 3rd pair of @beatsbydre powerbeats to work today I'm doneski man. This is a slap in my balls. Your next @Bose @BoseService","ID":2,"Label":1,"text_label":"complaint","output":"### Text: If I can't get my 3rd pair of @beatsbydre powerbeats to work today I'm doneski man. This is a slap in my balls. Your next @Bose @BoseService\n\n### Label: complaint"}
    {"Tweet text":"@EE On Rosneath Arial having good upload and download speeds but terrible latency 200ms. Why is this.","ID":3,"Label":1,"text_label":"complaint","output":"### Text: @EE On Rosneath Arial having good upload and download speeds but terrible latency 200ms. Why is this.\n\n### Label: complaint"}
    {"Tweet text":"Couples wallpaper, so cute. :) #BrothersAtHome","ID":4,"Label":2,"text_label":"no complaint","output":"### Text: Couples wallpaper, so cute. :) #BrothersAtHome\n\n### Label: no complaint"}
    {"Tweet text":"@mckelldogs This might just be me, but-- eyedrops? Artificial tears are so useful when you're sleep-deprived and sp\u2026 https:\/\/t.co\/WRtNsokblG","ID":5,"Label":2,"text_label":"no complaint","output":"### Text: @mckelldogs This might just be me, but-- eyedrops? Artificial tears are so useful when you're sleep-deprived and sp\u2026 https:\/\/t.co\/WRtNsokblG\n\n### Label: no complaint"}
    {"Tweet text":"@Yelp can we get the exact calculations for a business rating (for example if its 4 stars but actually 4.2) or do we use a 3rd party site?","ID":6,"Label":2,"text_label":"no complaint","output":"### Text: @Yelp can we get the exact calculations for a business rating (for example if its 4 stars but actually 4.2) or do we use a 3rd party site?\n\n### Label: no complaint"}
    {"Tweet text":"@nationalgridus I have no water and the bill is current and paid. Can you do something about this?","ID":7,"Label":1,"text_label":"complaint","output":"### Text: @nationalgridus I have no water and the bill is current and paid. Can you do something about this?\n\n### Label: complaint"}
    {"Tweet text":"Never shopping at @MACcosmetics again. Every time I go in there, their employees are super rude\/condescending. I'll take my $$ to @Sephora","ID":8,"Label":1,"text_label":"complaint","output":"### Text: Never shopping at @MACcosmetics again. Every time I go in there, their employees are super rude\/condescending. I'll take my $$ to @Sephora\n\n### Label: complaint"}
    {"Tweet text":"@JenniferTilly Merry Christmas to as well. You get more stunning every year \ufffd\ufffd","ID":9,"Label":2,"text_label":"no complaint","output":"### Text: @JenniferTilly Merry Christmas to as well. You get more stunning every year \ufffd\ufffd\n\n### Label: no complaint"}
---
apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
  name: ted-kfto-sft
  namespace: default
  labels:
    kueue.x-k8s.io/queue-name: lq-trainer
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: Never # Do not restart the pod on failure. If you do set it to OnFailure, be sure to also set backoffLimit
      template:
        spec:
          containers:
            - name: pytorch
              # This is the temp location util image is officially released
              #image: image-registry.openshift-image-registry.svc:5000/opendatahub/fms-hf-tuning:0.0.1rc7
              #image: quay.io/jbusche/fms-hf-tuning:issue758-1
              #image: quay.io/modh/fms-hf-tuning:01b3824c9aba22d9d0695399681e6f0507840e7f
              #image: quay.io/modh/fms-hf-tuning:a130d1c890501a4fac1d9522f1198b6273ade2d4
              image: quay.io/modh/fms-hf-tuning:release
              imagePullPolicy: IfNotPresent
              command:
                - "python"
                - "/app/accelerate_launch.py"
              env:
                - name: SFT_TRAINER_CONFIG_JSON_PATH
                  value: /etc/config/config.json
              volumeMounts:
              - name: config-volume
                mountPath: /etc/config
          volumes:
          - name: config-volume
            configMap:
              name: my-config
              items:
              - key: config.json
                path: config.json
              - key: twitter_complaints_small.json
                path: twitter_complaints_small.json
EOF

And then in a perfect world, it'll start up a pytorchjob and run to completion:

watch oc get pytorchjobs,pods -n default

and it'll look like this:

Every 2.0s: oc get pytorchjobs,pods -n default                                api.ted414.cp.fyre.ibm.com: Wed Apr 24 18:34:49 2024

NAME                                   STATE	   AGE
pytorchjob.kubeflow.org/ted-kfto-sft   Succeeded   58m

NAME                        READY   STATUS	RESTARTS   AGE
pod/ted-kfto-sft-master-0   0/1     Completed   0          58m

Manually building and pushing an image to the OpenShift cluster repository.

  1. First you need to be oc logged into the OpenShift cluster. For example:
oc login --token=sha256~eNI_S6ah... --server=https://api.jimfips.cp.fyre.ibm.com:6443
  1. Enable to local repository with this step, then wait a few minutes, you'll know when it's ready when step 3 below succeeds.
oc patch configs.imageregistry.operator.openshift.io/cluster --patch '{"spec":{"defaultRoute":true}}' --type=merge
  1. Using podman, login to your local OpenShift repository:
podman login -u kubeadmin -p $(oc whoami -t) $(oc registry info) --tls-verify=false
  1. Build a new sft-hf-trainer image from main and or your branch using these steps.

4.1 Download the repo from main:

git clone https://github.com/jbusche/fms-hf-tuning.git
cd fms-hf-tuning

4.2 Alternatively, you could download from your repo and use your PR branch like this:

git clone https://github.com/jbusche/fms-hf-tuning.git -b jb-828-python-cves
cd fms-hf-tuning

4.3 Build the image locally, naming it as you'd like (I used today's date):

docker build --progress=plain -t fms-hf-tuning:jim-0509-fixed . -f build/Dockerfile
  1. Podman login (if you haven't already), tag and push the image to your local registry:
podman login -u kubeadmin -p $(oc whoami -t) $(oc registry info) --tls-verify=false
podman tag localhost/fms-hf-tuning:jim-0509-fixed  $(oc registry info)/opendatahub/fms-hf-tuning:jim-0509-fixed
podman push  --tls-verify=false $(oc registry info)/opendatahub/fms-hf-tuning:jim-0509-fixed
  1. Run the test step from https://github.com/foundation-model-stack/fms-hf-tuning/wiki/Installing-and-Testing-OpenShift-fms%E2%80%90hf%E2%80%90tuning-Stack#6-testing only substitute in your image name. For example:
Change:
image: quay.io/modh/fms-hf-tuning:a130d1c890501a4fac1d9522f1198b6273ade2d4
to
image: image-registry.openshift-image-registry.svc:5000/opendatahub/fms-hf-tuning:jim-0509-fixed
  1. And then in a perfect world, it'll start up a pytorchjob and run to completion:
watch oc get pytorchjobs,pods -n default

and it'll look like this:

Every 2.0s: oc get pytorchjobs,pods -n default                                api.ted414.cp.fyre.ibm.com: Wed Apr 24 18:34:49 2024

NAME                                   STATE	   AGE
pytorchjob.kubeflow.org/ted-kfto-sft   Succeeded   58m

NAME                        READY   STATUS	RESTARTS   AGE
pod/ted-kfto-sft-master-0   0/1     Completed   0          58m

Cleanup

Cleanup of your pytorchjob and cm:

oc delete pytorchjob ted-kfto-sft  -n default
oc delete cm my-config -n default

Cleanup of your Kueue resouces, if you want that:

cat <<EOF | kubectl delete -f -
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: "cpu-flavor"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "cq-small"
spec:
  namespaceSelector: {} # match all.
  resourceGroups:
  - coveredResources: ["cpu", "memory"]
    flavors:
    - name: "cpu-flavor"
      resources:
      - name: "cpu"
        nominalQuota: 5
      - name: "memory"
        nominalQuota: 20Gi
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  name: lq-trainer
  namespace: default
spec:
  clusterQueue: cq-small
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
  name: p1
value: 30000
description: "high priority"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
  name: p2
value: 10000
description: "low priority"
EOF

Cleanup of dsc items (if you want that)

oc delete dsc default-dsc

Cleanup of DSCI (if you want that)

oc delete dsci default-dsci

Cleanup of ODH operators (if you want that)

oc delete sub authorino-operator opendatahub-operator servicemeshoperator -n openshift-operators
oc delete csv authorino-operator.v0.11.1  opendatahub-operator.v2.17.0 servicemeshoperator.v2.6.1 -n openshift-operators