Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quickstart broken: modelmesh-serving svc not created in --namespace-scope-mode, ISVCs never get to READY state #433

Closed
ckadner opened this issue Sep 20, 2023 · 7 comments · Fixed by #435
Assignees
Labels
bug Something isn't working regression
Milestone

Comments

@ckadner
Copy link
Member

ckadner commented Sep 20, 2023

Describe the bug

The quickstart install instructions no longer work correctly. After deploying a model, the InferenceService does not get into READY state. Inference requests cannot be performed.

To Reproduce

  1. Follow the Quickstart guide the point of deploying a model and wait for the runtime pods to be Running.
$ kubectl get pods

NAME                                              READY   STATUS    RESTARTS   AGE
etcd-558fb445c-ht596                              1/1     Running   0          9h
minio-774d48cf7-nbkd7                             1/1     Running   0          9h
modelmesh-controller-86dd54cc54-xhnvv             1/1     Running   0          9h
modelmesh-serving-mlserver-1.x-647d4f8f9c-h5b49   4/4     Running   0          26m
modelmesh-serving-mlserver-1.x-647d4f8f9c-hv6ht   4/4     Running   0          26m
  1. Wait a few minutes (or hours) and check the status of the newly created InferenceService.
$ kubectl get isvc

NAME                   URL   READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION   AGE
example-sklearn-isvc                                                                               26m
  1. Describe the ISVC, find no events:
$ kubectl describe isvc example-sklearn-isvc

Name:         example-sklearn-isvc
Namespace:    modelmesh-serving
Labels:       <none>
Annotations:  serving.kserve.io/deploymentMode: ModelMesh
              serving.kserve.io/secretKey: localMinIO
API Version:  serving.kserve.io/v1beta1
Kind:         InferenceService
Metadata:
  Creation Timestamp:  2023-09-20T03:21:13Z
  Generation:          2
  Managed Fields:
    API Version:  serving.kserve.io/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubectl.kubernetes.io/last-applied-configuration:
          f:serving.kserve.io/deploymentMode:
          f:serving.kserve.io/secretKey:
      f:spec:
        .:
        f:predictor:
          .:
          f:model:
            .:
            f:modelFormat:
              .:
              f:name:
            f:storageUri:
    Manager:         kubectl-client-side-apply
    Operation:       Update
    Time:            2023-09-20T03:46:54Z
  Resource Version:  61292
  UID:               2d2acd78-e0ce-4864-bd3f-54944d02ae85
Spec:
  Predictor:
    Model:
      Model Format:
        Name:       sklearn
      Storage Uri:  s3://modelmesh-example-models/sklearn/mnist-svm.joblib
Events:             <none>

  1. Ignore all that for a minute and go on to create the port-forward:
$ kubectl port-forward --address 0.0.0.0 service/modelmesh-serving 8033 -n modelmesh-serving

Error from server (NotFound): services "modelmesh-serving" not found
  1. Check the service list:
$ kubectl get svc

NAME                               TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
etcd                               ClusterIP   172.21.141.59    <none>        2379/TCP   9h
minio                              ClusterIP   172.21.11.102    <none>        9000/TCP   9h
modelmesh-webhook-server-service   ClusterIP   172.21.232.100   <none>        9443/TCP   9h
  • Note, the service "modelmesh-serving" does not exist.

Expected behavior

Up until release v0.11.0 when following the instructions in the Quickstart guide the inference service was created successfully, port-forward worked and inference requests returned expected response.

Additional context

How do the FVT test cases work? Curiously, when using the FVT install instructions using the --fvt flag (cluster-scoped install), the modelmesh-serving service does get created and the ISVC transitions to READY state okay.

$ kubectl get svc

NAME                               TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
etcd                               ClusterIP   172.21.33.247    <none>        2379/TCP                     2m7s
minio                              ClusterIP   172.21.215.204   <none>        9000/TCP                     2m6s
modelmesh-serving                  ClusterIP   None             <none>        8033/TCP,8008/TCP,2112/TCP   75s
modelmesh-webhook-server-service   ClusterIP   172.21.82.140    <none>        9443/TCP                     104s

Environment (please complete the following information):

  • OS: macOS client side, Ubuntu 20 server side
  • Browser: [n/a]
  • Version: MM latest of main, Kubernetes 1.25, 1.26, 1.27

UPDATE 2023-09-21:

FYI @Jooho

I went through some of the recent changes and it looks like PR #397 introduced this changed behavior. Could it be that the way your logic changes on how namespaces are reconciled affects how/when/whether the modelmesh-serving service gets created?

W0921   1 reflector.go:424] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:169: 
  failed to list *v1.Namespace: namespaces is forbidden: 
    User "system:serviceaccount:modelmesh-serving:modelmesh-controller" cannot list 
      resource "namespaces" in API group "" at the cluster scope

E0921   1 reflector.go:140] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:169:
  Failed to watch *v1.Namespace: failed to list *v1.Namespace: namespaces is forbidden:
    User "system:serviceaccount:modelmesh-serving:modelmesh-controller" cannot list
      resource "namespaces" in API group "" at the cluster scope
@ckadner ckadner added the bug Something isn't working label Sep 20, 2023
@ckadner
Copy link
Member Author

ckadner commented Sep 20, 2023

@rafvasq just pointed out that in cluster-scope mode (FVT) it did work

@ckadner ckadner self-assigned this Sep 20, 2023
@ckadner ckadner changed the title Quickstart install broken: modelmesh-serving service not created, ISVCs never get to READY state Quickstart broken: modelmesh-serving svc not created in --namespace-scope-mode, ISVCs never get to READY state Sep 21, 2023
@ckadner
Copy link
Member Author

ckadner commented Sep 21, 2023

@Jooho

I went through some of the recent changes and it looks like PR #397 introduced this changed behavior. Could it be that the way your logic changes on how namespaces are reconciled affects how/when/whether the modelmesh-serving service gets created?

@ckadner
Copy link
Member Author

ckadner commented Sep 21, 2023

I tagged some recent commits on main and triggered builds using the GH actions.

git tag -a pr-422 0b39182 -m "PR #422"
git push upstream pr-422
​
git tag -a pr-397 dd7277a -m "PR #397"
git push upstream pr-397
​
git tag -a pr-428 f47cd7b -m "PR #428"
git push upstream pr-428

The new pr-xxx images on Dockerhub help to "go back in time" and to deploy older versions of the modelmesh-controller right after a PR was merged.

@ckadner
Copy link
Member Author

ckadner commented Sep 21, 2023

Before PR #397

git checkout pr-428

HEAD is now at f47cd7b fix: Support devel version of Kustomize in install script (#428)
# replace newTag for controller image
git diff | grep -E "^\+"

+++ b/config/manager/kustomization.yaml
+    newTag: pr-428
kubectl create namespace modelmesh-serving
./scripts/install.sh --namespace-scope-mode --namespace modelmesh-serving --quickstart --enable-self-signed-ca

Successfully installed ModelMesh Serving!
kubectl get svc

NAME                               TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
etcd                               ClusterIP   172.21.171.73    <none>        2379/TCP                     2m7s
minio                              ClusterIP   172.21.190.108   <none>        9000/TCP                     2m7s
modelmesh-serving                  ClusterIP   None             <none>        8033/TCP,8008/TCP,2112/TCP   99s
modelmesh-webhook-server-service   ClusterIP   172.21.99.254    <none>        9443/TCP                     110s

After PR #397

git checkout pr-397

HEAD is now at dd7277a fix: Goroutine memory leak (#397)
# replace newTag for controller image
git diff | grep -E "^\+"

+++ b/config/manager/kustomization.yaml
+    newTag: pr-397
kubectl create namespace modelmesh-serving
./scripts/install.sh --namespace-scope-mode --namespace modelmesh-serving --quickstart --enable-self-signed-ca

Successfully installed ModelMesh Serving!
kubectl get svc

NAME                               TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
etcd                               ClusterIP   172.21.219.47   <none>        2379/TCP   6m37s
minio                              ClusterIP   172.21.177.41   <none>        9000/TCP   6m36s
modelmesh-webhook-server-service   ClusterIP   172.21.249.37   <none>        9443/TCP   6m14s

Missing service modelmesh-serving.

@ckadner
Copy link
Member Author

ckadner commented Sep 21, 2023

Some differences in the controller logs, before and after PR #397

  • in order of appearance in logs
  • timestamps removed
  • D = DEBUG, I = INFO
  • indentation changed for readability

Unique log messages before PR #397 (don't happen after PR #397)

D   KubeResolver            Built new resolver                 {"target": {"Scheme":"kube","Authority":"","URL":{"Scheme":"kube","Opaque":"","User":null,"Host":"","Path":"/modelmesh-serving.modelmesh-serving:8033","RawPath":"","OmitHost":false,"ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""}}, "name": "modelmesh-serving/modelmesh-serving"}
D   KubeResolver            Failed to update resolver due to bad state, requeuing endpoint reconciliation
D   KubeResolver.Reconcile  Ignoring event for Endpoints with no resolver   {"endpoints": "modelmesh-serving/modelmesh-serving"}
D   ModelMeshEventStream    Etcd config secret changed. Creating a new etcd client and restarting watchers. {"oldSecretName": "", "newSecretName": "model-serving-etcd"}
D   ModelMeshEventStream    ModelMesh Model Event              {"namespace": "modelmesh-serving", "modelId": "", "event": "INITIALIZED"}
D   ModelMeshEventStream    ModelMesh VModel Event             {"namespace": "modelmesh-serving", "vModelId": "", "event": "INITIALIZED"}
I   MMService               Established new MM gRPC connection {"namespace": "modelmesh-serving", "endpoint": "kube:///modelmesh-serving.modelmesh-serving:8033", "TLS": false}
I   ModelMeshEventStream    Initialize Model Event Stream      {"namespace": "modelmesh-serving", "servicePrefix": "modelmesh-serving/mm/modelmesh-serving"}
I   ModelMeshEventStream    EtcdRangeWatcher starting          {"namespace": "modelmesh-serving", "WatchPrefix": "modelmesh-serving/mm/modelmesh-serving/registry/"}
I   ModelMeshEventStream    EtcdRangeWatcher starting          {"namespace": "modelmesh-serving", "WatchPrefix": "modelmesh-serving/mm/modelmesh-serving/vmodels/"}
I   controllers.Service     Updated Kube Service               {"namespace": "modelmesh-serving", "name": "modelmesh-serving", }

Unique log messages after PR #397 (did not happen before PR #397)

E0921   1 reflector.go:140] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:169:
  Failed to watch *v1.Namespace: failed to list *v1.Namespace: namespaces is forbidden:
    User "system:serviceaccount:modelmesh-serving:modelmesh-controller" cannot list
      resource "namespaces" in API group "" at the cluster scope

W0921   1 reflector.go:424] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:169:
  failed to list *v1.Namespace: namespaces is forbidden:
    User "system:serviceaccount:modelmesh-serving:modelmesh-controller" cannot list
      resource "namespaces" in API group "" at the cluster scope

Different log messages (changed after PR `#397`)
  • before PR #397:

    I0921   1 request.go:690] Waited for 1s due to client-side throttling, not priority and fairness, request:
      GET:https://172.21.0.1:443/apis/coordination.k8s.io/v1?timeout=32s
    
  • after PR #397:

    I0921   1 request.go:690] Waited for 1s due to client-side throttling, not priority and fairness, request:
      GET:https://172.21.0.1:443/apis/node.k8s.io/v1?timeout=32s
    

@ckadner
Copy link
Member Author

ckadner commented Sep 21, 2023

FYI @Jooho @tjohnson31415

@ckadner
Copy link
Member Author

ckadner commented Sep 27, 2023

We should create a separate FVT workflow to test the namespace-scope mode -- to run in parallel to the existing cluster-scope FVT

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working regression
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants