Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add progress deadline support for SDeps #4235

Merged

Conversation

agrski
Copy link
Contributor

@agrski agrski commented Jul 28, 2022

What this PR does / why we need it:
Currently, Seldon Deployments do not support setting progressDeadlineSeconds, which is an optional setting used to control when a deployment is marked as failed.

This field is defined for deployments, not pods or containers, thus users cannot set this manually by overriding component specs in an SDep.

As progressDeadlineSeconds is optional in Kubernetes, with a default of 600 seconds, this change is backwards compatible -- any SDep not specifying this will use the Kubernetes default automatically.

Which issue(s) this PR fixes:

Fixes #4224

Special notes for your reviewer:

@agrski
Copy link
Contributor Author

agrski commented Jul 29, 2022

Testing these changes was a bit of a convoluted process but did confirm the intended behaviour.

For reference, I did the following:

  • Created a fresh kind cluster called ansible (using ansible-k8s-collection)
  • Generated new CRDs and a corresponding controller/operator image
$ make -C operator/ manifests
$ make -C operator/ kind-image-install
  • Ensured the Seldon Core Helm charts were up to date locally
$ make -C operator/helm/ create
  • Released the local Helm chart
$ helm upgrade --install seldon-core ./helm-charts/seldon-core-operator/ -n seldon-system
  • Modified the mean_classifier example to delay for a long time (1000 seconds) in its __init__() call
$ git diff -U1 examples/models/mean_classifier/MeanClassifier.py
diff --git a/examples/models/mean_classifier/MeanClassifier.py b/examples/models/mean_classifier/MeanClassifier.py
index f3bbf2421..f899b0d39 100644
--- a/examples/models/mean_classifier/MeanClassifier.py
+++ b/examples/models/mean_classifier/MeanClassifier.py
@@ -10,3 +10,3 @@ class MeanClassifier(object):
 
-    def __init__(self, intValue=0, delaySecs=0):
+    def __init__(self, intValue=0, delaySecs=1000):
         self.class_names = ["proba"]
@@ -21,2 +21,4 @@ class MeanClassifier(object):
         logging.info("loading model here")
+        time.sleep(self.delay_secs)
+        logging.info("loaded model after delay")
  • Built and loaded this modified model into my kind cluster
$ git diff -U0 examples/models/mean_classifier/Makefile 
diff --git a/examples/models/mean_classifier/Makefile b/examples/models/mean_classifier/Makefile
index 5d620a154..173c36b75 100644
--- a/examples/models/mean_classifier/Makefile
+++ b/examples/models/mean_classifier/Makefile
@@ -17 +17 @@ kind_load: build
-       kind load -v 3 docker-image ${IMAGE_BASE}:${VERSION}
+       kind load -v 3 docker-image ${IMAGE_BASE}:${VERSION} --name=ansible

$ make -C examples/models/mean_classifier/ kind_load
s2i build -E environment . seldonio/seldon-core-s2i-python38-ubi8:1.15.0-dev seldonio/mock_classifier:1.15.0-dev
...
kind load -v 3 docker-image seldonio/mock_classifier:1.15.0-dev --name=ansible
Image: "seldonio/mock_classifier:1.15.0-dev" with ID "sha256:7dbdc1af0116fa15612f5027c677440b024c7cd88ccb09d7ab4ce0d90f348569" not yet present on node "ansible-control-plane", loading...
  • Created an SDep with a 60-second progress deadline
$ git diff -U1 notebooks/resources/model.yaml 
diff --git a/notebooks/resources/model.yaml b/notebooks/resources/model.yaml
index a1e066e8e..d65aae10c 100644
--- a/notebooks/resources/model.yaml
+++ b/notebooks/resources/model.yaml
@@ -10,3 +10,3 @@ spec:
         containers:
-        - image: seldonio/mock_classifier:1.9.1
+        - image: seldonio/mock_classifier:1.15.0-dev
           name: classifier
@@ -19 +19,2 @@ spec:
     replicas: 1
+    progressDeadlineSeconds: 60

$ kubectl apply -f notebooks/resources/model.yaml 
  • Waited a minute and confirmed the SDep and its underlying deployment were marked as failed
$ k -n seldon get deployments.apps seldon-model-example-0-classifier -ojsonpath={..status} | jq
{
  "conditions": [
    {
      "lastTransitionTime": "2022-07-28T17:07:39Z",
      "lastUpdateTime": "2022-07-28T17:07:39Z",
      "message": "Deployment does not have minimum availability.",
      "reason": "MinimumReplicasUnavailable",
      "status": "False",
      "type": "Available"
    },
    {
      "lastTransitionTime": "2022-07-28T17:08:40Z",
      "lastUpdateTime": "2022-07-28T17:08:40Z",
      "message": "ReplicaSet \"seldon-model-example-0-classifier-58bfd5d7b\" has timed out progressing.",
      "reason": "ProgressDeadlineExceeded",
      "status": "False",
      "type": "Progressing"
    }
  ],
  "observedGeneration": 1,
  "replicas": 1,
  "unavailableReplicas": 1,
  "updatedReplicas": 1
}

$ k -n seldon get sdep seldon-model -o jsonpath={..status} | jq
{
  "address": {
    "url": "http://seldon-model-example.seldon.svc.cluster.local:8000/api/v1.0/predictions"
  },
  "conditions": [
    {
      "lastTransitionTime": "2022-07-28T17:07:39Z",
      "message": "Deployment does not have minimum availability.",
      "reason": "MinimumReplicasUnavailable",
      "status": "False",
      "type": "DeploymentsReady"
    },
    {
      "lastTransitionTime": "2022-07-28T17:07:39Z",
      "reason": "No HPAs defined",
      "status": "True",
      "type": "HpasReady"
    },
    {
      "lastTransitionTime": "2022-07-28T17:07:39Z",
      "reason": "No KEDA resources defined",
      "status": "True",
      "type": "KedaReady"
    },
    {
      "lastTransitionTime": "2022-07-28T17:07:39Z",
      "reason": "No PDBs defined",
      "status": "True",
      "type": "PdbsReady"
    },
    {
      "lastTransitionTime": "2022-07-28T17:08:40Z",
      "message": "Deployment does not have minimum availability.",
      "reason": "MinimumReplicasUnavailable",
      "status": "False",
      "type": "Ready"
    },
    {
      "lastTransitionTime": "2022-07-28T17:07:39Z",
      "reason": "Not all services created",
      "status": "False",
      "type": "ServicesReady"
    }
  ],
  "deploymentStatus": {
    "seldon-model-example-0-classifier": {
      "replicas": 1
    }
  },
  "description": "Deployment is no longer progressing and not available.",
  "replicas": 1,
  "state": "Failed"
}

@axsaucedo
Copy link
Contributor

Nice one @agrski ! Will test from branch and review. Seems executor test failed but we need to validate if its just the flaky one.

/text integration

@agrski
Copy link
Contributor Author

agrski commented Jul 29, 2022

Nice one @agrski ! Will test from branch and review. Seems executor test failed but we need to validate if its just the flaky one.

/text integration

From checking GHA, looks like both failures are flaky. Note this is still in draft as I also need to check in some changes to manifests -- just trying to fix my controller-gen setup as it's giving me a weird version and some changes I didn't expect.

@seldondev seldondev added size/L and removed size/S labels Jul 29, 2022
@agrski agrski requested a review from axsaucedo July 29, 2022 10:34
@@ -43,38 +29,6 @@ webhooks:
path: /validate-machinelearning-seldon-io-v1-seldondeployment
failurePolicy: Fail
name: v1.vseldondeployment.kb.io
{{- if semverCompare ">=1.15.0" .Capabilities.KubeVersion.GitVersion }}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ I'm not sure how/why this no longer exists -- was I running the wrong Make commands or missing a step?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the version of your controller gen? That often tends to cause variations in the generated tokens

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's 0.7.0 -- I had to uninstall v0.4 and use the make target to install v0.7, as it created a number of small differences. The generation of this file was purely with v0.7.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok that does sound like should be ok - could you try again using make -C operator/helm create? That should use the docker image to create it so it should be the exact same output

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I did before - it's documented in this comment, near the top. I did notice it's using Docker to run some of the other commands in the Makefile

@agrski agrski marked this pull request as ready for review July 29, 2022 11:07
@axsaucedo
Copy link
Contributor

/test integration

1 similar comment
@axsaucedo
Copy link
Contributor

/test integration

@axsaucedo
Copy link
Contributor

Not sure why the logs / updates from the CI tests are not appearing, but looking at the logs it seems there were a few failed tests, although it does seem like there may be an issue in the CI so I will have a look

image

@axsaucedo
Copy link
Contributor

/test integration

@agrski agrski force-pushed the add-progress-deadline-support-for-sdeps branch from 8551f53 to 699b9a0 Compare August 5, 2022 16:42
@agrski
Copy link
Contributor Author

agrski commented Aug 5, 2022

Force pushed after rebasing onto master

@axsaucedo
Copy link
Contributor

/test integration

@agrski
Copy link
Contributor Author

agrski commented Aug 8, 2022

Doesn't look like integration tests want to run -- I'm not seeing anything for this PR in JX

Edit: the UI was lying. I can see the logs for the pod corresponding to this PR are progressing, but they're very slow and full of failed connections to endpoints, so I'm guessing things are a bit slow to spin up in the tests.

@axsaucedo
Copy link
Contributor

@agrski there's actually an issue with the UI, more details here #4262

@axsaucedo
Copy link
Contributor

Running integration for retest of CI
/test integration

@axsaucedo
Copy link
Contributor

I've stopped the pipeline manually with jx stop pipeline to explore the webhook issue separate to this PR and look towards unblocking

@agrski agrski force-pushed the add-progress-deadline-support-for-sdeps branch from 699b9a0 to 97f4402 Compare August 12, 2022 09:34
@axsaucedo
Copy link
Contributor

/test integration

@axsaucedo
Copy link
Contributor

/test notebooks

@axsaucedo
Copy link
Contributor

Failed test is TestNotebooks.test_protocol_examples which seems flaky, rerunning to validate
/test notebooks

@axsaucedo
Copy link
Contributor

image

/approve

@axsaucedo axsaucedo merged commit d8ded76 into SeldonIO:master Sep 8, 2022
@seldondev
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: axsaucedo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Status value Long duration Creating
3 participants