Merge LM-Eval dev branch (#337)

* Add lm-eval-service controller (#258) * feat: Initial database support (#246) * Initial database support - Add status checking - Add better storage flags - Add spec.storage.format validation - Add DDL -Add HIBERNATE format to DB (test) - Update service image - Revert identifier to DATABASE - Update CR options (remove mandatory data) * Remove default DDL generation env var * Update service image to latest tag * Add migration awareness * Add updating pods for migration * Change JDBC url from mysql to mariadb * Fix TLS mount * Revert images * Remove redundant logic * Fix comments * feat: Add TLS certificate mount on ModelMesh (#255) * feat: Add TLS certificate mount on ModelMesh * Revert from http to https until kserve/modelmesh#147 is merged * Add lm-eval-service controller refactor the existing TrustyAIService controller and add LMEvalService controller Signed-off-by: Yihong Wang <[email protected]> --------- Signed-off-by: Yihong Wang <[email protected]> Co-authored-by: Rui Vieira <[email protected]> * fix: Fix typo in operator's arguments (#261) Operator's arguments changed from `--eanble-services` to `--enable-services`. trustyai.opendatahub.io_lmevaljobs.yaml and zz_generated.deepcopy.go regenerated. * feat: Add LMES driver build to GHA (#272) * sync: sync dev/lm-eval with main branch (#271) * feat: Initial database support (#246) * Initial database support - Add status checking - Add better storage flags - Add spec.storage.format validation - Add DDL -Add HIBERNATE format to DB (test) - Update service image - Revert identifier to DATABASE - Update CR options (remove mandatory data) * Remove default DDL generation env var * Update service image to latest tag * Add migration awareness * Add updating pods for migration * Change JDBC url from mysql to mariadb * Fix TLS mount * Revert images * Remove redundant logic * Fix comments * feat: Add TLS certificate mount on ModelMesh (#255) * feat: Add TLS certificate mount on ModelMesh * Revert from http to https until kserve/modelmesh#147 is merged * Pin oc version, ubi version (#263) * Restore checkout of trustyai-exp (#265) * Add operator installation robustness (#266) * fix: Skip InferenceService patching for KServe RawDeployment (#262) * feat: ConfigMap key to disable KServe Serverless configuration (#267) * feat: Add support for custom certificates in database connection (#259) * Add TLS endpoint for ModelMesh payload processors. (#268) Keep non-TLS endpoint for KServe Serverless (disabled by default) --------- Signed-off-by: Yihong Wang <[email protected]> Co-authored-by: Rui Vieira <[email protected]> Co-authored-by: Rob Geada <[email protected]> * Weekly sync up of dev/lm-eval branch (#278) * feat: Initial database support (#246) * Initial database support - Add status checking - Add better storage flags - Add spec.storage.format validation - Add DDL -Add HIBERNATE format to DB (test) - Update service image - Revert identifier to DATABASE - Update CR options (remove mandatory data) * Remove default DDL generation env var * Update service image to latest tag * Add migration awareness * Add updating pods for migration * Change JDBC url from mysql to mariadb * Fix TLS mount * Revert images * Remove redundant logic * Fix comments * feat: Add TLS certificate mount on ModelMesh (#255) * feat: Add TLS certificate mount on ModelMesh * Revert from http to https until kserve/modelmesh#147 is merged * Pin oc version, ubi version (#263) * Restore checkout of trustyai-exp (#265) * Add operator installation robustness (#266) * fix: Skip InferenceService patching for KServe RawDeployment (#262) * feat: ConfigMap key to disable KServe Serverless configuration (#267) * feat: Add support for custom certificates in database connection (#259) * Add TLS endpoint for ModelMesh payload processors. (#268) Keep non-TLS endpoint for KServe Serverless (disabled by default) * fix: Correct maxSurge and maxUnavailable (#275) * feat: Add support for custom DB names (#257) * feat: Add support for custom DB names * fix: Correct custom DB name --------- Signed-off-by: Yihong Wang <[email protected]> Co-authored-by: Rui Vieira <[email protected]> Co-authored-by: Rob Geada <[email protected]> * Driver updates job's status periodically (#280) The driver periodically update the LMEvalJob.Status.Message field with the outputs from the lm-eval. The message pattern the driver captures is like `Running text generation: 81%|`. Then users can use this information to check the progress of the job. Signed-off-by: Yihong Wang <[email protected]> * Add Dockerfile for LMES job image (#276) Add Dockerfile for LMES job image and the needed files Signed-off-by: Yihong Wang <[email protected]> * feat: Add overlays (#283) * feat: Add overlays * Remove redundant lmes-tas overlay. Change job image name. * Add job image build (#284) * Change job image use midstream lm-evaluation-harness (#285) * feat: support batch size (#290) Add batch size support in the LMEvalJob which leverages the `--batch_size` in the `lm-evaluation-harness`. This only affects the local models. The `--bath_size` doesn't work for remote inference APIs. Signed-off-by: Yihong Wang <[email protected]> * Add the `openai` package into the lmes job image (#292) update the LMES job's Dockerfile to include the `openai` package. Signed-off-by: Yihong Wang <[email protected]> * fix: fix dependency error in the job image (#296) Split up the unitxt and openai dependencies to avoid the conflict. Signed-off-by: Yihong Wang <[email protected]> * feat: add device detection in lmes driver (#298) Added a new feature in LMES driver to detect the available devices by using the PyTorch API. This feature can be disabled by passing the `--detect-device false` option. Signed-off-by: Yihong Wang <[email protected]> * feat: support unitxt recipes (#301) Add new fields in the CRD to support unitxt recipes and leverage the driver to create corresponding yaml files of the unitxt recipes. Signed-off-by: Yihong Wang <[email protected]> * feat: support custom dataset (#309) Updated the CRD data struct to allow users to specify a custom Unitxt card in JSON format. The custom Unitxt card is equivalent to a custom dataset definition. Also restructured and updated the CRD to support Volumes, VolumeMounts, Env, Resources, Labels, and Annotations. Signed-off-by: Yihong Wang <[email protected]> * feat: new pulling mechanism for job statuses (#314) Update the driver to keep running even the user program finishes. The driver provides two APIs: - GetStatus(): retrieve job status - Shutdown(): properly tear down the driver In the controller side, it uses `pod/exec` resource to run the driver command to invoke the driver APIs to retrieve the job status and shutdown the driver when job is done. Signed-off-by: Yihong Wang <[email protected]> * Move operator's cmd/operator/main.go to cmd/main.go to keep operator-sdk compatibility (#295) * Remove hardcoded job's user ID (#322) * Fix mkdir command in Job dockerfile (#330) * Refactor some lmesreconcile methods (#323) * Refactor lmes reconcile optoins Signed-off-by: ted chang <[email protected]> * Update controllers/lmes/lmevaljob_controller.go Co-authored-by: Yihong Wang <[email protected]> * Update controllers/lmes/lmevaljob_controller.go Co-authored-by: Yihong Wang <[email protected]> Signed-off-by: ted chang <[email protected]> --------- Signed-off-by: ted chang <[email protected]> Co-authored-by: Yihong Wang <[email protected]> * tidy: clean up lmes-job image (#333) remove BAM related packages and patch. Signed-off-by: Yihong Wang <[email protected]> * Enable job suspend for Kueue (#317) * Refactor lmes reconcile optoins Signed-off-by: ted chang <[email protected]> * Update controllers/lmes/lmevaljob_controller.go Co-authored-by: Yihong Wang <[email protected]> * Update controllers/lmes/lmevaljob_controller.go Co-authored-by: Yihong Wang <[email protected]> Signed-off-by: ted chang <[email protected]> * Enable job suspend for Kueue Signed-off-by: ted chang <[email protected]> --------- Signed-off-by: ted chang <[email protected]> Co-authored-by: Yihong Wang <[email protected]> * Add overlay placeholders for main merge (#334) * sync: sync up dev/lm-eval branch with main branch (#336) * [CI] Run tests from trustyai-tests (#279) * Change Dockerfile to clone trustyai-tests * Add PYTEST_MARKERS env and remove TESTS_REGEX * RHOAIENG-12274: Update operator's overlays (#287) * Update operator's overlays * Update kustomization.yaml * Add devflag printout to GH Action comment (#289) * Add timeout loop to DSC install (#305) * RHOAIENG-13625: Add DBAvailable status to CR (#304) * Add DBAvailable status to CR * Remove probes * Add KServe destination rule for Inference Services in the ServiceMesh (#315) * Add DestinationRule creation for KServe serverless * Add permissions for destination rules * Add role for destination rules * Add missing role for creating destination rules * Fix spacing in DestinationRule template * Add check if DestinationRule CRD is present before creating it (#316) * Add check for DestinationRule CRD * Add API extensions to operator's scheme * Add permission for CRD resource * Fix operator metrics service target port (#320) * Add readiness probes (#312) * Enable KServe serverless in the rhoai overlay (#321) * Update overlay images (#331) * Add correct CA cert to JDBC (#324) * Add correct CA cert to JDBC * Add require SSL * Support for VirtualServices for InferenceLogger traffic (#332) * Generate KServe Inference Logger in conformance with DestinationRule and VirtualService * Add VirtualService creation for models in the mesh * Add permissions for VirtualServices * Update manifests for VirtualServices * Fix VirtualServiceName variable * fix yaml linter after the sync Signed-off-by: Yihong Wang <[email protected]> * tidy the go.mod and go.sum as well Signed-off-by: Yihong Wang <[email protected]> --------- Signed-off-by: Yihong Wang <[email protected]> Co-authored-by: Adolfo Aguirrezabal <[email protected]> Co-authored-by: Rui Vieira <[email protected]> Co-authored-by: Rob Geada <[email protected]> Co-authored-by: Rui Vieira <[email protected]> --------- Signed-off-by: Yihong Wang <[email protected]> Signed-off-by: ted chang <[email protected]> Co-authored-by: Yihong Wang <[email protected]> Co-authored-by: Rob Geada <[email protected]> Co-authored-by: ted chang <[email protected]> Co-authored-by: Adolfo Aguirrezabal <[email protected]>
trustyai-explainability · Oct 22, 2024 · f70af36 · f70af36
1 parent ddd093a
commit f70af36
Show file tree

Hide file tree

Showing 83 changed files with 7,730 additions and 340 deletions.
diff --git a/.github/workflows/build-and-push.yaml b/.github/workflows/build-and-push.yaml
@@ -56,6 +56,8 @@ jobs:
           echo "GITHUB.HEAD_REF: ${{ github.head_ref }}"
           echo "SHA: ${{ github.event.pull_request.head.sha }}"
           echo "MAIN IMAGE AT: ${{ vars.QUAY_RELEASE_REPO }}:latest"
+          echo "LMES DRIVER IMAGE AT: ${{ vars.QUAY_RELEASE_LMES_DRIVER_REPO }}:latest"
+          echo "LMES JOB IMAGE AT: ${{ vars.QUAY_RELEASE_LMES_JOB_REPO }}:latest"
           echo "CI IMAGE AT: quay.io/trustyai/trustyai-service-operator-ci:${{ github.event.pull_request.head.sha }}"
       #
       # Set environments depending on context
@@ -64,27 +66,41 @@ jobs:
         run: |
           echo "TAG=${{ github.event.pull_request.head.sha }}" >> $GITHUB_ENV
           echo "IMAGE_NAME=quay.io/trustyai/trustyai-service-operator-ci" >> $GITHUB_ENV
+          echo "DRIVER_IMAGE_NAME=quay.io/trustyai/ta-lmes-driver-ci" >> $GITHUB_ENV
+          echo "JOB_IMAGE_NAME=quay.io/trustyai/ta-lmes-job-ci" >> $GITHUB_ENV
       - name: Set main-branch environment
         if:  env.BUILD_CONTEXT == 'main'
         run: |
           echo "TAG=latest" >> $GITHUB_ENV
           echo "IMAGE_NAME=${{ vars.QUAY_RELEASE_REPO }}" >> $GITHUB_ENV
+          echo "DRIVER_IMAGE_NAME=${{ vars.QUAY_RELEASE_LMES_DRIVER_REPO }}" >> $GITHUB_ENV
+          echo "JOB_IMAGE_NAME=${{ vars.QUAY_RELEASE_LMES_JOB_REPO }}" >> $GITHUB_ENV
       - name: Set tag environment
         if: env.BUILD_CONTEXT == 'tag'
         run: |
           echo "TAG=${{ github.ref_name }}" >> $GITHUB_ENV
           echo "IMAGE_NAME=${{ vars.QUAY_RELEASE_REPO }}" >> $GITHUB_ENV
+          echo "DRIVER_IMAGE_NAME=${{ vars.QUAY_RELEASE_LMES_DRIVER_REPO }}" >> $GITHUB_ENV
+          echo "JOB_IMAGE_NAME=${{ vars.QUAY_RELEASE_LMES_JOB_REPO }}" >> $GITHUB_ENV
 
-       # Run docker commands
+      # Run docker commands
       - name: Put expiry date on CI-tagged image
         if:  env.BUILD_CONTEXT == 'ci'
         run: sed -i 's#summary="odh-trustyai-service-operator\"#summary="odh-trustyai-service-operator" \\ \n      quay.expires-after=7d#' Dockerfile
       - name: Log in to Quay
         run: docker login -u ${{ secrets.QUAY_ROBOT_USERNAME }} -p ${{ secrets.QUAY_ROBOT_SECRET }} quay.io
-      - name: Build image
+      - name: Build main image
         run: docker build -t ${{ env.IMAGE_NAME }}:$TAG .
-      - name: Push to Quay CI repo
+      - name: Push main image to Quay
         run: docker push ${{ env.IMAGE_NAME }}:$TAG
+      - name: Build LMES driver image
+        run: docker build -f Dockerfile.driver -t ${{ env.DRIVER_IMAGE_NAME }}:$TAG .
+      - name: Push LMES driver image to Quay
+        run: docker push ${{ env.DRIVER_IMAGE_NAME }}:$TAG
+      - name: Build LMES job image
+        run: docker build -f Dockerfile.lmes-job -t ${{ env.JOB_IMAGE_NAME }}:$TAG .
+      - name: Push LMES job image to Quay
+        run: docker push ${{ env.JOB_IMAGE_NAME }}:$TAG
 
       # Create CI Manifests
       - name: Set up manifests for CI
@@ -127,6 +143,10 @@ jobs:
 
             📦 [PR image](https://quay.io/trustyai/trustyai-service-operator-ci:${{ github.event.pull_request.head.sha }}): `quay.io/trustyai/trustyai-service-operator-ci:${{ github.event.pull_request.head.sha }}`
 
+            📦 [LMES driver image](https://quay.io/trustyai/ta-lmes-driver:${{ github.event.pull_request.head.sha }}): `quay.io/trustyai/ta-lmes-driver:${{ github.event.pull_request.head.sha }}`
+
+            📦 [LMES job image](https://quay.io/trustyai/ta-lmes-job:${{ github.event.pull_request.head.sha }}): `quay.io/trustyai/ta-lmes-job:${{ github.event.pull_request.head.sha }}`
+
             🗂️ [CI manifests](https://github.com/trustyai-explainability/trustyai-service-operator-ci/tree/operator-${{ env.TAG }})
             
             ```

diff --git a/.github/workflows/controller-tests.yaml b/.github/workflows/controller-tests.yaml
@@ -13,7 +13,7 @@ jobs:
       - name: Setup Go
         uses: actions/setup-go@v4
         with:
-          go-version: '1.19.0'
+          go-version: '1.21.12'
 
       - name: Download & install envtest binaries
         run: |

diff --git a/.yamllint.yaml b/.yamllint.yaml
@@ -6,4 +6,6 @@ rules:
     level: warning
   hyphens:
     max-spaces-after: 1
-    level: warning
+    level: warning
+  indentation:
+    indent-sequences: consistent
diff --git a/Dockerfile b/Dockerfile
@@ -1,5 +1,5 @@
 # Build the manager binary
-FROM registry.access.redhat.com/ubi8/go-toolset:1.21 as builder
+FROM registry.access.redhat.com/ubi8/go-toolset:1.21 AS builder
 ARG TARGETOS
 ARG TARGETARCH
 
@@ -12,7 +12,7 @@ COPY go.sum go.sum
 RUN go mod download
 
 # Copy the go source
-COPY main.go main.go
+COPY cmd/ cmd/
 COPY api/ api/
 COPY controllers/ controllers/
 
@@ -22,7 +22,7 @@ COPY controllers/ controllers/
 # the docker BUILDPLATFORM arg will be linux/arm64 when for Apple x86 it will be linux/amd64. Therefore,
 # by leaving it empty we can ensure that the container and binary shipped on it will have the same platform.
 USER root
-RUN CGO_ENABLED=0 GOOS=${TARGETOS:-linux} GOARCH=${TARGETARCH} go build -a -o manager main.go
+RUN CGO_ENABLED=0 GOOS=${TARGETOS:-linux} GOARCH=${TARGETARCH} go build -a -o manager cmd/main.go
 
 # Use distroless as minimal base image to package the manager binary
 # Refer to https://github.com/GoogleContainerTools/distroless for more details

diff --git a/Dockerfile.driver b/Dockerfile.driver
@@ -0,0 +1,25 @@
+FROM registry.access.redhat.com/ubi8/go-toolset:1.21 AS builder
+
+WORKDIR /go/src/github.com/trustyai-explainability/trustyai-service-operator
+# Copy the Go Modules manifests
+COPY go.mod go.mod
+COPY go.sum go.sum
+# cache deps before building and copying source so that we don't need to re-download as much
+# and so that source changes don't invalidate our downloaded layer
+RUN go mod download
+# Copy the go source
+COPY cmd/ cmd/
+COPY api/ api/
+COPY controllers/ controllers/
+
+RUN GO111MODULE=on CGO_ENABLED=0 GOOS=linux go build -tags netgo -ldflags '-extldflags "-static"' -o /bin/driver ./cmd/lmes_driver/*.go
+
+FROM registry.access.redhat.com/ubi8/ubi-minimal:latest
+
+COPY --from=builder /bin/driver /bin/driver
+
+USER 65532:65532
+
+WORKDIR /bin
+
+ENTRYPOINT [ "/bin/driver" ]
diff --git a/Dockerfile.lmes-job b/Dockerfile.lmes-job
@@ -0,0 +1,27 @@
+FROM registry.access.redhat.com/ubi9/python-311@sha256:fccda5088dd13d2a3f2659e4c904beb42fc164a0c909e765f01af31c58affae3
+
+USER root
+RUN sed -i.bak 's/include-system-site-packages = false/include-system-site-packages = true/' /opt/app-root/pyvenv.cfg
+
+USER default
+WORKDIR /opt/app-root/src
+RUN mkdir /opt/app-root/src/hf_home && chmod g+rwx /opt/app-root/src/hf_home
+RUN mkdir /opt/app-root/src/output && chmod g+rwx /opt/app-root/src/output
+RUN mkdir /opt/app-root/src/my_tasks && chmod g+rwx /opt/app-root/src/my_tasks
+RUN mkdir -p /opt/app-root/src/my_catalogs/cards && chmod -R g+rwx /opt/app-root/src/my_catalogs
+RUN mkdir -p /opt/app-root/src/.cache
+ENV PATH="/opt/app-root/bin:/opt/app-root/src/.local/bin/:/opt/app-root/src/bin:/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
+
+# Clone the Git repository, check out v0.4.4 and install the Python package
+RUN git clone https://github.com/opendatahub-io/lm-evaluation-harness.git && \
+    cd lm-evaluation-harness &&  git checkout 543617fef9ba885e87f8db8930fbbff1d4e2ca49 && \
+    pip install --no-cache-dir --user -e .[api]
+
+RUN python -c 'from lm_eval.tasks.unitxt import task; import os.path; print("class: !function " + task.__file__.replace("task.py", "task.Unitxt"))' > ./my_tasks/unitxt
+
+ENV PYTHONPATH=/opt/app-root/src/.local/lib/python3.11/site-packages:/opt/app-root/src/lm-evaluation-harness:/opt/app-root/src:/opt/app-root/src/server
+ENV HF_HOME=/opt/app-root/src/hf_home
+ENV UNITXT_ARTIFACTORIES=/opt/app-root/src/my_catalogs
+
+CMD ["/opt/app-root/bin/python"]
+
diff --git a/Makefile b/Makefile
@@ -7,6 +7,9 @@ VERSION ?= 1.17.0
 
 BUILD_TOOL ?= podman
 
+# enable TrustyAIService by default for `make run`
+ENABLED_SERVICES ?= TAS
+
 # CHANNELS define the bundle channels used in the bundle.
 # Add a new line here if you would like to change its default config. (E.g CHANNELS = "candidate,fast,stable")
 # To re-generate a bundle for other specific channels without changing the standard setup, you can:
@@ -111,11 +114,11 @@ test: manifests generate fmt vet envtest ## Run tests.
 
 .PHONY: build
 build: manifests generate fmt vet ## Build manager binary.
-	go build -o bin/manager main.go
+	go build -o bin/manager cmd/main.go
 
 .PHONY: run
 run: manifests generate fmt vet ## Run a controller from your host.
-	go run ./main.go
+	go run ./cmd/main.go --enable-services $(ENABLED_SERVICES)
 
 # If you wish built the manager image targeting other platforms you can use the --platform flag.
 # (i.e. docker build --platform linux/arm64 ). However, you must enable docker buildKit for it.
@@ -182,7 +185,7 @@ ENVTEST ?= $(LOCALBIN)/setup-envtest
 
 ## Tool Versions
 KUSTOMIZE_VERSION ?= v3.8.7
-CONTROLLER_TOOLS_VERSION ?= v0.11.1
+CONTROLLER_TOOLS_VERSION ?= v0.16.3
 
 KUSTOMIZE_INSTALL_SCRIPT ?= "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh"
 .PHONY: kustomize

diff --git a/PROJECT b/PROJECT
@@ -4,7 +4,7 @@
 # More info: https://book.kubebuilder.io/reference/project-config.html
 domain: opendatahub.io
 layout:
-- go.kubebuilder.io/v3
+- go.kubebuilder.io/v4
 plugins:
   manifests.sdk.operatorframework.io/v2: {}
   scorecard.sdk.operatorframework.io/v2: {}
@@ -18,6 +18,19 @@ resources:
   domain: opendatahub.io
   group: trustyai
   kind: TrustyAIService
-  path: github.com/trustyai-explainability/trustyai-service-operator/api/v1alpha1
+  path: github.com/trustyai-explainability/trustyai-service-operator/api/tas/v1alpha1
   version: v1alpha1
+- api:
+    crdVersion: v1
+    namespaced: true
+  controller: true
+  domain: opendatahub.io
+  group: trustyai
+  kind: LMEvalJob
+  path: github.com/trustyai-explainability/trustyai-service-operator/api/lmes/v1alpha1
+  version: v1alpha1
+  webhooks:
+    defaulting: true
+    validation: true
+    webhookVersion: v1
 version: "3"
diff --git a/api/lmes/v1alpha1/groupversion_info.go b/api/lmes/v1alpha1/groupversion_info.go
@@ -0,0 +1,43 @@
+/*
+Copyright 2024.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+*/
+
+// Package v1alpha1 contains API Schema definitions for the trustyai.opendatahub.io v1alpha1 API group
+// +kubebuilder:object:generate=true
+// +groupName=trustyai.opendatahub.io
+package v1alpha1
+
+import (
+	"k8s.io/apimachinery/pkg/runtime/schema"
+	"sigs.k8s.io/controller-runtime/pkg/scheme"
+)
+
+const (
+	GroupName     = "trustyai.opendatahub.io"
+	Version       = "v1alpha1"
+	KindName      = "LMEvalJob"
+	FinalizerName = "trustyai.opendatahub.io/lmes-finalizer"
+)
+
+var (
+	// GroupVersion is group version used to register these objects
+	GroupVersion = schema.GroupVersion{Group: GroupName, Version: Version}
+
+	// SchemeBuilder is used to add go types to the GroupVersionKind scheme
+	SchemeBuilder = &scheme.Builder{GroupVersion: GroupVersion}
+
+	// AddToScheme adds the types in this group-version to the given scheme.
+	AddToScheme = SchemeBuilder.AddToScheme
+)