Enable job suspend for Kueue #317

tedhtchang · 2024-10-10T20:31:27Z

Enable job suspend for Kueue integration; however this feature should work independent of Kueue as well.
How to test this feature:

ENABLED_SERVICES=LMES CONTROLLER_TOOLS_VERSION=v0.16.3 IMG=quay.io/tedchang/trustyai-service-operator:latest make run

Create a job in suspend state. Verify job is in suspend state job's pod is not running.

cat << EOF| kubectl apply -f -
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  labels:
    app.kubernetes.io/name: fms-lm-eval-service
    app.kubernetes.io/managed-by: kustomize
  name: evaljob-sample
  namespace: default
spec:
  suspend: true
  model: hf
  modelArgs:
  - name: pretrained
    value: EleutherAI/pythia-70m
  taskList:
    taskNames:
    - unfair_tos
  logSamples: true
  limit: "5"
EOF

Set suspend to false and verify job's pod getting created and running

oc patch lmevaljob evaljob-sample --patch '{"spec":{"suspend":false}}' --type merge

openshift-ci · 2024-10-10T20:31:39Z

Hi @tedhtchang. Thanks for your PR.

I'm waiting for a trustyai-explainability member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

tedhtchang · 2024-10-12T00:10:28Z

/cc @yhwang

yhwang

Great change and left some minor comments. One thing I noticed is that you update the kube-builder to v0.16.3. I guess you also need to update the Makefile to reflect that.

yhwang · 2024-10-14T16:25:20Z

api/lmes/v1alpha1/lmevaljob_types.go

@@ -236,6 +238,8 @@ type LMEvalJobSpec struct {
 	// Specify extra information for the lm-eval job's pod
 	// +optional
 	Pod *LMEvalPodSpec `json:"pod,omitempty"`
+	// Suspend keeps the job but without pods. This is intended to be used by the Kueue integration
+	Suspend bool `json:"suspend,omitempty"`


//+optional ?

yhwang · 2024-10-14T16:41:48Z

controllers/lmes/lmevaljob_controller.go

@@ -181,6 +181,10 @@ func (r *LMEvalJobReconciler) Reconcile(ctx context.Context, req ctrl.Request) (
 		job.Status.State = lmesv1alpha1.NewJobState
 	}

+	if job.Spec.Suspend {
+		r.handleSuspend(ctx, log, job)


missing the return

yhwang · 2024-10-14T16:44:07Z

controllers/lmes/lmevaljob_controller.go

+		if !job.Spec.Suspend {
+			return r.handleResume(ctx, log, job)
+		}
+		return ctrl.Result{}, nil


you can remove this and fall through to the end of the function, there is a final no-op return there.

yhwang · 2024-10-14T16:47:45Z

controllers/lmes/lmevaljob_controller.go

+		log.Info("Suspend job")
+		if err := r.deleteJobPod(ctx, job); err != nil && client.IgnoreNotFound(err) != nil {
+			log.Error(err, "failed to delete pod for suspended job")
+			return ctrl.Result{Requeue: true, RequeueAfter: r.options.PodCheckingInterval}, err


please update this to return r.pullingJobs.addOrUpdate(string(job.GetUID()), r.options.PodCheckingInterval), nil since the pulling mechanism is merged.

yhwang · 2024-10-14T16:48:07Z

controllers/lmes/lmevaljob_controller.go

+	pod := r.createPod(job, log)
+	if err := r.Create(ctx, pod); err != nil {
+		log.Error(err, "failed to create pod to resume job")
+		return ctrl.Result{Requeue: true, RequeueAfter: r.options.PodCheckingInterval}, err


please update this to return r.pullingJobs.addOrUpdate(string(job.GetUID()), r.options.PodCheckingInterval), nil since the pulling mechanism is merged.

yhwang

/LGTM

github-actions · 2024-10-14T21:40:25Z

PR image build and manifest generation completed successfully!

📦 PR image: quay.io/trustyai/trustyai-service-operator-ci:d050f126cf43eeb6aae070c99dd8ccc6fbfc0713

📦 LMES driver image: quay.io/trustyai/ta-lmes-driver:d050f126cf43eeb6aae070c99dd8ccc6fbfc0713

📦 LMES job image: quay.io/trustyai/ta-lmes-job:d050f126cf43eeb6aae070c99dd8ccc6fbfc0713

🗂️ CI manifests

yhwang · 2024-10-14T22:20:25Z

/cc @ruivieira

ah, you already added the ok-to-test label. Thanks!

Signed-off-by: ted chang <[email protected]>

Co-authored-by: Yihong Wang <[email protected]>

Co-authored-by: Yihong Wang <[email protected]> Signed-off-by: ted chang <[email protected]>

Signed-off-by: ted chang <[email protected]>

tedhtchang · 2024-10-18T02:07:57Z

@yhwang @ruivieira I rebased this PR with changes from the refactor PR because this it needs some of the refactored functions. I needed to tested it out.

ruivieira · 2024-10-19T11:19:24Z

@yhwang @tedhtchang thanks for this, LGTM!

I've merged #323. Let me know if this PR is ready for merge, too.

ruivieira · 2024-10-19T11:20:46Z

/approve

openshift-ci · 2024-10-19T11:20:51Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ruivieira, yhwang

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tedhtchang · 2024-10-20T18:27:11Z

@yhwang @tedhtchang thanks for this, LGTM!

I've merged #323. Let me know if this PR is ready for merge, too.

Hi @ruivieira this should be ready to merge. I didn't see conflict.

* Add lm-eval-service controller (#258) * feat: Initial database support (#246) * Initial database support - Add status checking - Add better storage flags - Add spec.storage.format validation - Add DDL -Add HIBERNATE format to DB (test) - Update service image - Revert identifier to DATABASE - Update CR options (remove mandatory data) * Remove default DDL generation env var * Update service image to latest tag * Add migration awareness * Add updating pods for migration * Change JDBC url from mysql to mariadb * Fix TLS mount * Revert images * Remove redundant logic * Fix comments * feat: Add TLS certificate mount on ModelMesh (#255) * feat: Add TLS certificate mount on ModelMesh * Revert from http to https until kserve/modelmesh#147 is merged * Add lm-eval-service controller refactor the existing TrustyAIService controller and add LMEvalService controller Signed-off-by: Yihong Wang <[email protected]> --------- Signed-off-by: Yihong Wang <[email protected]> Co-authored-by: Rui Vieira <[email protected]> * fix: Fix typo in operator's arguments (#261) Operator's arguments changed from `--eanble-services` to `--enable-services`. trustyai.opendatahub.io_lmevaljobs.yaml and zz_generated.deepcopy.go regenerated. * feat: Add LMES driver build to GHA (#272) * sync: sync dev/lm-eval with main branch (#271) * feat: Initial database support (#246) * Initial database support - Add status checking - Add better storage flags - Add spec.storage.format validation - Add DDL -Add HIBERNATE format to DB (test) - Update service image - Revert identifier to DATABASE - Update CR options (remove mandatory data) * Remove default DDL generation env var * Update service image to latest tag * Add migration awareness * Add updating pods for migration * Change JDBC url from mysql to mariadb * Fix TLS mount * Revert images * Remove redundant logic * Fix comments * feat: Add TLS certificate mount on ModelMesh (#255) * feat: Add TLS certificate mount on ModelMesh * Revert from http to https until kserve/modelmesh#147 is merged * Pin oc version, ubi version (#263) * Restore checkout of trustyai-exp (#265) * Add operator installation robustness (#266) * fix: Skip InferenceService patching for KServe RawDeployment (#262) * feat: ConfigMap key to disable KServe Serverless configuration (#267) * feat: Add support for custom certificates in database connection (#259) * Add TLS endpoint for ModelMesh payload processors. (#268) Keep non-TLS endpoint for KServe Serverless (disabled by default) --------- Signed-off-by: Yihong Wang <[email protected]> Co-authored-by: Rui Vieira <[email protected]> Co-authored-by: Rob Geada <[email protected]> * Weekly sync up of dev/lm-eval branch (#278) * feat: Initial database support (#246) * Initial database support - Add status checking - Add better storage flags - Add spec.storage.format validation - Add DDL -Add HIBERNATE format to DB (test) - Update service image - Revert identifier to DATABASE - Update CR options (remove mandatory data) * Remove default DDL generation env var * Update service image to latest tag * Add migration awareness * Add updating pods for migration * Change JDBC url from mysql to mariadb * Fix TLS mount * Revert images * Remove redundant logic * Fix comments * feat: Add TLS certificate mount on ModelMesh (#255) * feat: Add TLS certificate mount on ModelMesh * Revert from http to https until kserve/modelmesh#147 is merged * Pin oc version, ubi version (#263) * Restore checkout of trustyai-exp (#265) * Add operator installation robustness (#266) * fix: Skip InferenceService patching for KServe RawDeployment (#262) * feat: ConfigMap key to disable KServe Serverless configuration (#267) * feat: Add support for custom certificates in database connection (#259) * Add TLS endpoint for ModelMesh payload processors. (#268) Keep non-TLS endpoint for KServe Serverless (disabled by default) * fix: Correct maxSurge and maxUnavailable (#275) * feat: Add support for custom DB names (#257) * feat: Add support for custom DB names * fix: Correct custom DB name --------- Signed-off-by: Yihong Wang <[email protected]> Co-authored-by: Rui Vieira <[email protected]> Co-authored-by: Rob Geada <[email protected]> * Driver updates job's status periodically (#280) The driver periodically update the LMEvalJob.Status.Message field with the outputs from the lm-eval. The message pattern the driver captures is like `Running text generation: 81%|`. Then users can use this information to check the progress of the job. Signed-off-by: Yihong Wang <[email protected]> * Add Dockerfile for LMES job image (#276) Add Dockerfile for LMES job image and the needed files Signed-off-by: Yihong Wang <[email protected]> * feat: Add overlays (#283) * feat: Add overlays * Remove redundant lmes-tas overlay. Change job image name. * Add job image build (#284) * Change job image use midstream lm-evaluation-harness (#285) * feat: support batch size (#290) Add batch size support in the LMEvalJob which leverages the `--batch_size` in the `lm-evaluation-harness`. This only affects the local models. The `--bath_size` doesn't work for remote inference APIs. Signed-off-by: Yihong Wang <[email protected]> * Add the `openai` package into the lmes job image (#292) update the LMES job's Dockerfile to include the `openai` package. Signed-off-by: Yihong Wang <[email protected]> * fix: fix dependency error in the job image (#296) Split up the unitxt and openai dependencies to avoid the conflict. Signed-off-by: Yihong Wang <[email protected]> * feat: add device detection in lmes driver (#298) Added a new feature in LMES driver to detect the available devices by using the PyTorch API. This feature can be disabled by passing the `--detect-device false` option. Signed-off-by: Yihong Wang <[email protected]> * feat: support unitxt recipes (#301) Add new fields in the CRD to support unitxt recipes and leverage the driver to create corresponding yaml files of the unitxt recipes. Signed-off-by: Yihong Wang <[email protected]> * feat: support custom dataset (#309) Updated the CRD data struct to allow users to specify a custom Unitxt card in JSON format. The custom Unitxt card is equivalent to a custom dataset definition. Also restructured and updated the CRD to support Volumes, VolumeMounts, Env, Resources, Labels, and Annotations. Signed-off-by: Yihong Wang <[email protected]> * feat: new pulling mechanism for job statuses (#314) Update the driver to keep running even the user program finishes. The driver provides two APIs: - GetStatus(): retrieve job status - Shutdown(): properly tear down the driver In the controller side, it uses `pod/exec` resource to run the driver command to invoke the driver APIs to retrieve the job status and shutdown the driver when job is done. Signed-off-by: Yihong Wang <[email protected]> * Move operator's cmd/operator/main.go to cmd/main.go to keep operator-sdk compatibility (#295) * Remove hardcoded job's user ID (#322) * Fix mkdir command in Job dockerfile (#330) * Refactor some lmesreconcile methods (#323) * Refactor lmes reconcile optoins Signed-off-by: ted chang <[email protected]> * Update controllers/lmes/lmevaljob_controller.go Co-authored-by: Yihong Wang <[email protected]> * Update controllers/lmes/lmevaljob_controller.go Co-authored-by: Yihong Wang <[email protected]> Signed-off-by: ted chang <[email protected]> --------- Signed-off-by: ted chang <[email protected]> Co-authored-by: Yihong Wang <[email protected]> * tidy: clean up lmes-job image (#333) remove BAM related packages and patch. Signed-off-by: Yihong Wang <[email protected]> * Enable job suspend for Kueue (#317) * Refactor lmes reconcile optoins Signed-off-by: ted chang <[email protected]> * Update controllers/lmes/lmevaljob_controller.go Co-authored-by: Yihong Wang <[email protected]> * Update controllers/lmes/lmevaljob_controller.go Co-authored-by: Yihong Wang <[email protected]> Signed-off-by: ted chang <[email protected]> * Enable job suspend for Kueue Signed-off-by: ted chang <[email protected]> --------- Signed-off-by: ted chang <[email protected]> Co-authored-by: Yihong Wang <[email protected]> * Add overlay placeholders for main merge (#334) * sync: sync up dev/lm-eval branch with main branch (#336) * [CI] Run tests from trustyai-tests (#279) * Change Dockerfile to clone trustyai-tests * Add PYTEST_MARKERS env and remove TESTS_REGEX * RHOAIENG-12274: Update operator's overlays (#287) * Update operator's overlays * Update kustomization.yaml * Add devflag printout to GH Action comment (#289) * Add timeout loop to DSC install (#305) * RHOAIENG-13625: Add DBAvailable status to CR (#304) * Add DBAvailable status to CR * Remove probes * Add KServe destination rule for Inference Services in the ServiceMesh (#315) * Add DestinationRule creation for KServe serverless * Add permissions for destination rules * Add role for destination rules * Add missing role for creating destination rules * Fix spacing in DestinationRule template * Add check if DestinationRule CRD is present before creating it (#316) * Add check for DestinationRule CRD * Add API extensions to operator's scheme * Add permission for CRD resource * Fix operator metrics service target port (#320) * Add readiness probes (#312) * Enable KServe serverless in the rhoai overlay (#321) * Update overlay images (#331) * Add correct CA cert to JDBC (#324) * Add correct CA cert to JDBC * Add require SSL * Support for VirtualServices for InferenceLogger traffic (#332) * Generate KServe Inference Logger in conformance with DestinationRule and VirtualService * Add VirtualService creation for models in the mesh * Add permissions for VirtualServices * Update manifests for VirtualServices * Fix VirtualServiceName variable * fix yaml linter after the sync Signed-off-by: Yihong Wang <[email protected]> * tidy the go.mod and go.sum as well Signed-off-by: Yihong Wang <[email protected]> --------- Signed-off-by: Yihong Wang <[email protected]> Co-authored-by: Adolfo Aguirrezabal <[email protected]> Co-authored-by: Rui Vieira <[email protected]> Co-authored-by: Rob Geada <[email protected]> Co-authored-by: Rui Vieira <[email protected]> --------- Signed-off-by: Yihong Wang <[email protected]> Signed-off-by: ted chang <[email protected]> Co-authored-by: Yihong Wang <[email protected]> Co-authored-by: Rob Geada <[email protected]> Co-authored-by: ted chang <[email protected]> Co-authored-by: Adolfo Aguirrezabal <[email protected]>

openshift-ci bot added do-not-merge/work-in-progress needs-ok-to-test labels Oct 10, 2024

tedhtchang force-pushed the enable-suspend branch 2 times, most recently from a02ce47 to 3245c63 Compare October 11, 2024 23:11

tedhtchang marked this pull request as ready for review October 11, 2024 23:53

openshift-ci bot removed the do-not-merge/work-in-progress label Oct 11, 2024

openshift-ci bot requested a review from yhwang October 12, 2024 00:10

openshift-merge-robot added the needs-rebase label Oct 14, 2024

yhwang reviewed Oct 14, 2024

View reviewed changes

tedhtchang force-pushed the enable-suspend branch from 3245c63 to 30089eb Compare October 14, 2024 20:36

openshift-merge-robot removed the needs-rebase label Oct 14, 2024

tedhtchang force-pushed the enable-suspend branch 4 times, most recently from f6ca540 to 9d9b959 Compare October 14, 2024 21:27

yhwang approved these changes Oct 14, 2024

View reviewed changes

openshift-ci bot assigned yhwang Oct 14, 2024

openshift-ci bot added the lgtm label Oct 14, 2024

ruivieira added ok-to-test and removed needs-ok-to-test labels Oct 14, 2024

openshift-ci bot requested a review from ruivieira October 14, 2024 22:20

tedhtchang mentioned this pull request Oct 16, 2024

Add initial Kueue integration #313

Merged

tedhtchang and others added 4 commits October 15, 2024 22:17

Refactor lmes reconcile optoins

1b6df74

Signed-off-by: ted chang <[email protected]>

Update controllers/lmes/lmevaljob_controller.go

6ea5144

Co-authored-by: Yihong Wang <[email protected]>

Update controllers/lmes/lmevaljob_controller.go

3b943ad

Co-authored-by: Yihong Wang <[email protected]> Signed-off-by: ted chang <[email protected]>

Enable job suspend for Kueue

d050f12

Signed-off-by: ted chang <[email protected]>

tedhtchang force-pushed the enable-suspend branch from 9d9b959 to d050f12 Compare October 18, 2024 01:59

openshift-ci bot removed the lgtm label Oct 18, 2024

yhwang added the lm-eval Issues related to LM-Eval label Oct 18, 2024

ruivieira approved these changes Oct 19, 2024

View reviewed changes

openshift-ci bot assigned ruivieira Oct 19, 2024

openshift-ci bot added the lgtm label Oct 19, 2024

ruivieira merged commit b54e222 into trustyai-explainability:dev/lm-eval Oct 21, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable job suspend for Kueue #317

Enable job suspend for Kueue #317

tedhtchang commented Oct 10, 2024 •

edited

Loading

openshift-ci bot commented Oct 10, 2024

tedhtchang commented Oct 12, 2024

yhwang left a comment

yhwang Oct 14, 2024

yhwang Oct 14, 2024

yhwang Oct 14, 2024

yhwang Oct 14, 2024

yhwang Oct 14, 2024

yhwang left a comment

github-actions bot commented Oct 14, 2024 •

edited

Loading

yhwang commented Oct 14, 2024 •

edited

Loading

tedhtchang commented Oct 18, 2024 •

edited

Loading

ruivieira commented Oct 19, 2024

ruivieira commented Oct 19, 2024

openshift-ci bot commented Oct 19, 2024

tedhtchang commented Oct 20, 2024

Enable job suspend for Kueue #317

Enable job suspend for Kueue #317

Conversation

tedhtchang commented Oct 10, 2024 • edited Loading

openshift-ci bot commented Oct 10, 2024

tedhtchang commented Oct 12, 2024

yhwang left a comment

Choose a reason for hiding this comment

yhwang Oct 14, 2024

Choose a reason for hiding this comment

yhwang Oct 14, 2024

Choose a reason for hiding this comment

yhwang Oct 14, 2024

Choose a reason for hiding this comment

yhwang Oct 14, 2024

Choose a reason for hiding this comment

yhwang Oct 14, 2024

Choose a reason for hiding this comment

yhwang left a comment

Choose a reason for hiding this comment

github-actions bot commented Oct 14, 2024 • edited Loading

yhwang commented Oct 14, 2024 • edited Loading

tedhtchang commented Oct 18, 2024 • edited Loading

ruivieira commented Oct 19, 2024

ruivieira commented Oct 19, 2024

openshift-ci bot commented Oct 19, 2024

tedhtchang commented Oct 20, 2024

tedhtchang commented Oct 10, 2024 •

edited

Loading

github-actions bot commented Oct 14, 2024 •

edited

Loading

yhwang commented Oct 14, 2024 •

edited

Loading

tedhtchang commented Oct 18, 2024 •

edited

Loading