Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

combine virtualservices into one #3609

Merged
merged 8 commits into from
Oct 25, 2021
Merged

Conversation

mwm5945
Copy link
Contributor

@mwm5945 mwm5945 commented Sep 21, 2021

What this PR does / why we need it:
Combines the istio virtual services into one to allow for better use of mesh networking.

Which issue(s) this PR fixes:

Fixes #3485 #1472

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

combine istio virtual services into one

@seldondev
Copy link
Collaborator

Hi @mwm5945. Thanks for your PR.

I'm waiting for a SeldonIO or todo member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the jenkins-x/lighthouse repository.

@mwm5945 mwm5945 marked this pull request as ready for review September 22, 2021 13:14
@mwm5945 mwm5945 changed the title combine virtualservices into one [do not merge] combine virtualservices into one Sep 22, 2021
@mwm5945
Copy link
Contributor Author

mwm5945 commented Sep 22, 2021

/assign @majolo @axsaucedo

@ukclivecox
Copy link
Contributor

/test integration

@ukclivecox
Copy link
Contributor

/test notebooks

@ukclivecox
Copy link
Contributor

Hi @mwm5945 Have you tested with upgrades. I would hope

//Cleanup unused VirtualService. This should usually only happen on Operator upgrades where there is a breaking change to the names of the VirtualServices created
//Only run if we have virtualservices to create - implies we are running with istio active
if len(components.virtualServices) > 0 && ready {
cleaner := ResourceCleaner{instance: instance, client: r.Client, virtualServices: components.virtualServices, logger: r.Log}
deleted, err := cleaner.cleanUnusedVirtualServices()
if err != nil {

works as expected.

@ukclivecox
Copy link
Contributor

/test notebooks

@mwm5945
Copy link
Contributor Author

mwm5945 commented Sep 29, 2021

@cliveseldon let me validate and get back to you, i did test with the canary example and it worked as expected, but i'd like to do some more validation!

@mwm5945
Copy link
Contributor Author

mwm5945 commented Sep 30, 2021

@cliveseldon i've confirmed that the behavior is the same as the old operator:

Canary invoked via a gateway:

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  labels:
    app: seldon
  name: canary-example-1
  namespace: my-ns
spec:
  name: canary-example-1
  predictors:
  - componentSpecs:
    - spec:
        containers:
        - image: seldonio/mock_classifier:1.11.0
          imagePullPolicy: IfNotPresent
          name: classifier
        terminationGracePeriodSeconds: 1
    graph:
      children: []
      endpoint:
        type: GRPC
      name: classifier
      type: MODEL
    labels:
      sidecar.istio.io/inject: "true"
    name: main
    replicas: 1
    traffic: 75
  - componentSpecs:
    - spec:
        containers:
        - image: seldonio/mock_classifier:1.11.0
          imagePullPolicy: IfNotPresent
          name: classifier
        terminationGracePeriodSeconds: 1
    graph:
      children: []
      endpoint:
        type: GRPC
      name: classifier
      type: MODEL
    labels:
      sidecar.istio.io/inject: "true"
    name: canary
    replicas: 1
    traffic: 25

Canary with Mesh (invoked from inside the Istio Mesh)

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  labels:
    app: seldon
  name: canary-example-1
  namespace: my-ns
spec:
  annotations:
    seldon.io/istio-gateway: mesh
    seldon.io/istio-host: canary-example-1
  name: canary-example-1
  predictors:
  - annotations:
      seldon.io/svc-name: canary-example-1
    componentSpecs:
    - spec:
        containers:
        - image: seldonio/mock_classifier:1.11.0
          imagePullPolicy: IfNotPresent
          name: classifier
          securityContext:
            readOnlyRootFilesystem: false
        terminationGracePeriodSeconds: 1
    graph:
      children: []
      endpoint:
        type: GRPC
      name: classifier
      type: MODEL
    labels:
      sidecar.istio.io/inject: "true"
    name: main
    replicas: 1
    traffic: 75
  - componentSpecs:
    - spec:
        containers:
        - image: seldonio/mock_classifier:1.11.0
          imagePullPolicy: IfNotPresent
          name: classifier
        terminationGracePeriodSeconds: 1
    graph:
      children: []
      endpoint:
        type: GRPC
      name: classifier
      type: MODEL
    labels:
      sidecar.istio.io/inject: "true"
    name: canary
    replicas: 1
    traffic: 25

@mwm5945
Copy link
Contributor Author

mwm5945 commented Sep 30, 2021

Creating/Updating/Deleting SDEPs also cleans all resources up

@ukclivecox
Copy link
Contributor

That's great @mwm5945 I was thinking more of the upgrade process for someone running 1.11 and then switching and upgrading to this. I see the upgrade test did fail in notebooks test.

@ukclivecox
Copy link
Contributor

/test integration

@mwm5945
Copy link
Contributor Author

mwm5945 commented Sep 30, 2021

i realized that after posting haha--i did "upgrade" from the public v1.11.0 image to a v1.11.0 image that included my changes, and the old VSs were deleted, and the new combined ones were created accordingly

it doesn't look like i have access to view whats failing in the tests though

@ukclivecox
Copy link
Contributor

@mwm5945 great

@mwm5945
Copy link
Contributor Author

mwm5945 commented Sep 30, 2021

@cliveseldon in a weird coincidence to the code you were referencing before, we discovered an instance where the seldon operator would delete a virtual service (that it didn't create) if the ownerReference name was the same. I've included an update to only delete the VS if the UID matches that of the SDEP.

would you be able to let me know what tests are failing and where? I can't seem to get make test working locally :/

@ukclivecox
Copy link
Contributor

/test notebooks

@ukclivecox
Copy link
Contributor

/test integration

@ukclivecox
Copy link
Contributor

@cliveseldon in a weird coincidence to the code you were referencing before, we discovered an instance where the seldon operator would delete a virtual service (that it didn't create) if the ownerReference name was the same. I've included an update to only delete the VS if the UID matches that of the SDEP.

Interesting. So in what circumstances would the name be the same but it did not create the virtual service and its not one of the virtual services listed for that SeldonDeployment?

would you be able to let me know what tests are failing and where? I can't seem to get make test working locally :/

2 of the tests in integration are known to be flaky so I think we can ignore those. For the notebook tests its also the flaky upgrade test which do cover this case but I think is probably ok to ignore. Am running the tests again.

@mwm5945
Copy link
Contributor Author

mwm5945 commented Oct 1, 2021

Interesting. So in what circumstances would the name be the same but it did not create the virtual service and its not one of the virtual services listed for that SeldonDeployment?

Without getting too far into it, we have some other CRs that do lots of other things in clusters, and are usually named the same as the SDEP. One solution is to rename our custom CRs, but adding this logic to the operator seems safer in the long term :)

2 of the tests in integration are known to be flaky so I think we can ignore those. For the notebook tests its also the flaky upgrade test which do cover this case but I think is probably ok to ignore. Am running the tests again.

Thanks for letting me know!

@ukclivecox
Copy link
Contributor

/test notebooks

@ukclivecox
Copy link
Contributor

/test integration

@seldondev seldondev added size/L and removed size/M labels Oct 1, 2021
@mwm5945
Copy link
Contributor Author

mwm5945 commented Oct 1, 2021

@axsaucedo added some documentation on using the mesh, not sure whats making the docs build break though...

@axsaucedo
Copy link
Contributor

@mwm5945 for docs test we've added a PR that fixed it so if you rebase should be passing again

@axsaucedo
Copy link
Contributor

/test integration

@axsaucedo
Copy link
Contributor

/test integration

@axsaucedo
Copy link
Contributor

/test notebooks

@axsaucedo
Copy link
Contributor

Nice one @mwm5945 - looks good

My current question is that most e2e tests are currently run with Ambassador, would you be able to try running a previous version of Seldon Core (ie 1.9.1) wth a couple of modela and upgrade to just validate that the conversation of virtualservices is done correctly?

I will also try a test similar to that to validate and send requests, just to make sure that there is no upgrade hiccups to avoid downtime.

Other than that looks good from my side

@seldondev
Copy link
Collaborator

@mwm5945: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
notebooks 1bf00a9 link /test notebooks

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the jenkins-x/lighthouse repository. I understand the commands that are listed here.

@axsaucedo
Copy link
Contributor

It seems all notebook tests pass - the only failure is because the 1.12.0-dev charts have been pushed for some reason and this test fails if the dev charts are already bee updated, so seems good on the tests front.

If you can validate the comment above then we should be good to merge.

image

@mwm5945
Copy link
Contributor Author

mwm5945 commented Oct 22, 2021

@axsaucedo i've confirmed that upgrading will cause no downtime, as well as downgrading :)

@mwm5945
Copy link
Contributor Author

mwm5945 commented Oct 22, 2021

While running a perf test, i upgraded from 1.10, watched the new VS be created, and then the old ones deleted w/o errors. Same goes for the reverse.

@axsaucedo
Copy link
Contributor

Nice one @mwm5945 - tested locally as well and all seems good
/approve

@seldondev
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: axsaucedo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@axsaucedo
Copy link
Contributor

Only thing is we should make a note in the UPGRADING.md page for 1.12, but we can do this in a separate PR (would just be a heads up for people that upgrade that the virtual service would be modified)

@axsaucedo axsaucedo merged commit f637bdf into SeldonIO:master Oct 25, 2021
stephen37 pushed a commit to stephen37/seldon-core that referenced this pull request Dec 21, 2021
* combine virtualservices into one

* fix return vsvcs

* rename vsvc

* update istio cleaner to only delete with matching uid

* add istio mesh docs

* add images

* remove azure metrics file

* remove azure metrics file
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Better support Istio Mesh Internal to cluster
5 participants