Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deploying custom MLflow model - stuck at "Readiness probe failed" #3186

Closed
FilipVel opened this issue May 12, 2021 · 11 comments
Closed

Deploying custom MLflow model - stuck at "Readiness probe failed" #3186

FilipVel opened this issue May 12, 2021 · 11 comments
Labels

Comments

@FilipVel
Copy link

FilipVel commented May 12, 2021

Hey, so I have problem deploying a custom MLflow model made with the 'mlflow.pyfunc.model'

The model that I have is as follows:

class CarModel(mlflow.pyfunc.PythonModel):
    def __init__(self,model,std_scaler,pw_trans,imputer,encoders,preprocess_data):
        self.model = model
        self.std_scaler = std_scaler
        self.pw_trans = pw_trans
        self.imputer = imputer
        self.encoders = encoders
        self.preprocess_data = preprocess_data
    def predict(self, context, model_input):
        processed = self.preprocess_data(df = model_input,
                                        std_scaler = self.std_scaler,
                                        pw_trans = self.pw_trans,
                                        imputer = self.imputer,
                                        encoders = self.encoders)
        return self.model.predict(processed)

Apart from the model I have the MLmodel file with the following content:

flavors:
  python_function:
    cloudpickle_version: 1.5.0
    env: conda.yaml
    loader_module: mlflow.pyfunc.model
    python_model: python_model.pkl
    python_version: 3.8.3
utc_time_created: '2021-05-12 09:08:02.201931'

and conda.yaml

channels:
- defaults
- conda-forge
dependencies:
- python=3.8.3
- pip
- pip:
  - mlflow
  - xgboost==1.3.3
  - cloudpickle==1.5.0
  - scikit-learn==0.24.1
name: mlflow-env

When I try to run the model locally it with mlflow models serve -m model_name it works just fine.

Next I uploaded the model with the conda.yaml and MLmodel files to a Google cloud bucket that I wanted to use as a source to build a seldon core deployment.
I tried deploying the model with the following code:

kubectl apply  -f - << END
apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
  name: mlflow
  namespace: seldon
spec:
  name: car_predictor
  predictors:
    - graph:
        children: []
        implementation: MLFLOW_SERVER
        modelUri: gs://car_app_model/try1
        name: classifier
      name: default
      replicas: 1
END

The pod then gets the following status:

seldon          mlflow-default-0-classifier-6f4cc6994-gjn67                 0/2     Running            1          2m59s

Next I run 'kubectl describe pod mlflow-default-0-classifier-6f4cc6994-gjn67 -n seldon' where I get the following events:

Events:
  Type     Reason     Age               From               Message
  ----     ------     ----              ----               -------
  Normal   Scheduled  50s               default-scheduler  Successfully assigned seldon/mlflow-default-0-classifier-6f4cc6994-gjn67 to gke-cluster-1-default-pool-27b34b9a-dbqf
  Normal   Pulled     49s               kubelet            Container image "gcr.io/kfserving/storage-initializer:v0.4.0" already present on machine
  Normal   Created    49s               kubelet            Created container classifier-model-initializer
  Normal   Started    49s               kubelet            Started container classifier-model-initializer
  Normal   Pulled     38s               kubelet            Container image "seldonio/mlflowserver:1.7.0" already present on machine
  Normal   Created    38s               kubelet            Created container classifier
  Normal   Started    38s               kubelet            Started container classifier
  Normal   Pulled     38s               kubelet            Container image "docker.io/seldonio/seldon-core-executor:1.7.0" already present on machine
  Normal   Created    38s               kubelet            Created container seldon-container-engine
  Normal   Started    38s               kubelet            Started container seldon-container-engine
  Warning  Unhealthy  5s (x3 over 15s)  kubelet            Readiness probe failed: HTTP probe failed with statuscode: 503
  Warning  Unhealthy  3s (x4 over 18s)  kubelet            Readiness probe failed: dial tcp 10.4.1.19:9000: connect: connection refused

After googling a bit I tried adding readinessProbe field to the deployment file like this(note that I had to use validate = false):

kubectl apply --validate=false -f - << END
apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
  name: mlflow
  namespace: seldon
spec:
  name: car_predictor
  livenessProbe:
    initialDelaySeconds: 100
  readinessProbe:
    initialDelaySeconds: 100

  predictors:
    - graph:
        children: []
        implementation: MLFLOW_SERVER
        modelUri: gs://car_app_model/try1
        name: classifier
      name: default
      replicas: 1
END


Even after trying this the same error persists. What could be the problem?

@FilipVel FilipVel added bug triage Needs to be triaged and prioritised accordingly labels May 12, 2021
@ukclivecox
Copy link
Contributor

Do you see any errors in the container logs?

@ukclivecox ukclivecox removed the triage Needs to be triaged and prioritised accordingly label May 17, 2021
@FilipVel
Copy link
Author

FilipVel commented May 18, 2021

The seldon-container-engine is showing this error:

{"level":"error","ts":1621340845.169119,"logger":"SeldonRestApi","msg":"Ready check failed","error":"dial tcp 127.0.0.1:9000:  
connect: connection refused","stacktrace":"github.com/seldonio/seldon-core/executor/api/rest.(*SeldonRestApi).checkReady\n\t/workspace/api/rest/server.go:188\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2036\ngithub.com/seldonio/seldon-core/executor/api/rest.handleCORSRequests.func1\n\t/workspace/api/rest/middlewares.go:64\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2036\ngithub.com/gorilla/mux.CORSMethodMiddleware.func1.1\n\t/go/pkg/mod/github.com/gorilla/[email protected]/middleware.go:51\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2036\ngithub.com/seldonio/seldon-core/executor/api/rest.xssMiddleware.func1\n\t/workspace/api/rest/middlewares.go:87\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2036\ngithub.com/seldonio/seldon-core/executor/api/rest.  

(*CloudeventHeaderMiddleware).Middleware.func1\n\t/workspace/api/rest/middlewares.go:47\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2036\ngithub.com/seldonio/seldon-core/executor/api/rest.puidHeader.func1\n\t/workspace/api/rest/middlewares.go:79\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2036\ngithub.com/gorilla/mux.(*Router).ServeHTTP\n\t/go/pkg/mod/github.com/gorilla/[email protected]/mux.go:210\nnet/http.serverHandler.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2831\nnet/http.(*conn).serve\n\t/usr/local/go/src/net/http/server.go:1919"}

The classifier container logs get to this point:

---> Creating environment with Conda...
INFO:root:Copying contents of /mnt/models to local
INFO:root:Reading MLmodel file
INFO:root:Creating Conda environment 'mlflow' from conda.yaml
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

Downloading and Extracting Packages
wheel-0.36.2         | 33 KB     | ########## | 100%
readline-8.1         | 362 KB    | ########## | 100%
_libgcc_mutex-0.1    | 3 KB      | ########## | 100%
pip-21.0.1           | 1.8 MB    | ########## | 100%
ld_impl_linux-64-2.3 | 568 KB    | ########## | 100%
tk-8.6.10            | 3.0 MB    | ########## | 100%
openssl-1.1.1k       | 2.5 MB    | ########## | 100%
sqlite-3.35.4        | 981 KB    | ########## | 100%
ncurses-6.2          | 817 KB    | ########## | 100%
certifi-2020.12.5    | 141 KB    | ########## | 100%
setuptools-52.0.0    | 714 KB    | ########## | 100%
python-3.8.3         | 49.1 MB   | ########## | 100%
libffi-3.3           | 50 KB     | ########## | 100%
ca-certificates-2021 | 114 KB    | ########## | 100%
xz-5.2.5             | 341 KB    | ########## | 100%
Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done
Installing pip dependencies: ...working...

and then getting back to

---> Creating environment with Conda...
INFO:root:Copying contents of /mnt/models to local
INFO:root:Reading MLmodel file
INFO:root:Creating Conda environment 'mlflow' from conda.yaml

I also tried with the following deployment file:

kubectl apply  -f - << END
apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
  name: mlflow
  namespace: seldon
spec:
  name: car_predictor
  predictors:
    - graph:
        children: []
        implementation: MLFLOW_SERVER
        modelUri: gs://car_app_model/try1
        name: classifier
      name: default
      replicas: 1
      componentSpecs:
        - spec:
            containers:
              - name: classifier
                readinessProbe:
                  failureThreshold: 10
                  initialDelaySeconds: 100
                  periodSeconds: 30
                  successThreshold: 1
                  timeoutSeconds: 3
END

but this way the pod wasn't showing at all.

@Sheldelraze
Copy link

Did you tried disabling istio sidecar like this? I got this liveness probing problem too although it's 403 not 503

@sleebapaul
Copy link

@FilipVel Were you able to solve it? I'm stuck at the same issue.

apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
  name: rf-regressor
  namespace: seldon-system
spec:
  annotations:
    seldon.io/executor: "false" 
  predictors:
    - graph:
        children: []
        implementation: MLFLOW_SERVER
        modelUri: s3://models/rf-regressor  # note: s3 points to minio-seldon in the local kind cluster
        envSecretRefName: seldon-rclone-secret
        name: rf-regressor
      name: default
      replicas: 1
      componentSpecs:
       - spec:
          containers:
          - name: rf-regressor
            livenessProbe:
              initialDelaySeconds: 150
              failureThreshold: 10
              periodSeconds: 50
              successThreshold: 1
              tcpSocket:
                    port: 9000
              # httpGet:
              #   path: /health/ping
              #   port: http
              #   scheme: HTTP
              timeoutSeconds: 3
            readinessProbe:
              initialDelaySeconds: 150
              failureThreshold: 10
              periodSeconds: 50
              successThreshold: 1
              tcpSocket:
                    port: 9000
              # httpGet:
              #   path: /health/ping
              #   port: http
              #   scheme: HTTP
              timeoutSeconds: 3

Conda YAML

channels:
- conda-forge
dependencies:
- python=3.9.6
- pip
- pip:
  - mlflow
  - cloudpickle==2.0.0
  - scikit-learn==1.0
name: mlflow-env

LOGS

Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  14m                    default-scheduler  Successfully assigned seldon-system/rf-regressor-default-0-rf-regressor-8466bd7c74-6g8hm to docker-desktop
  Normal   Created    14m                    kubelet            Created container rf-regressor-model-initializer
  Normal   Started    14m                    kubelet            Started container rf-regressor-model-initializer
  Normal   Pulled     14m                    kubelet            Container image "seldonio/rclone-storage-initializer:1.11.2" already present on machine
  Normal   Started    14m                    kubelet            Started container rf-regressor
  Normal   Pulled     14m                    kubelet            Container image "seldonio/mlflowserver:1.11.2" already present on machine
  Normal   Created    14m                    kubelet            Created container rf-regressor
  Normal   Killing    14m                    kubelet            Container seldon-container-engine failed liveness probe, will be restarted
  Normal   Pulled     13m (x2 over 14m)      kubelet            Container image "docker.io/seldonio/engine:1.11.2" already present on machine
  Normal   Started    13m (x2 over 14m)      kubelet            Started container seldon-container-engine
  Normal   Created    13m (x2 over 14m)      kubelet            Created container seldon-container-engine
  Warning  Unhealthy  13m (x4 over 14m)      kubelet            Liveness probe failed: Get "https://10.1.0.31:8082/live": dial tcp 10.1.0.31:8082: connect: connection refused
  Warning  Unhealthy  9m34s (x40 over 14m)   kubelet            Readiness probe failed: Get "https://10.1.0.31:8082/ready": dial tcp 10.1.0.31:8082: connect: connection refused
  Warning  BackOff    4m26s (x22 over 9m8s)  kubelet            Back-off restarting failed container

@agrski
Copy link
Contributor

agrski commented Oct 27, 2021

@sleebapaul there are a couple of things that might be worth checking.

The first is that your predictor spec mentions port 9000 but your logs are for port 8082 - that seems inconsistent? If you check the pod spec in Kubernetes, which container uses port 8082?

The second is that you might need to increase the initialDelaySeconds or failureThreshold, specifically for the livenessProbe. MLFlow pulls dependencies at runtime, so it can be really slow to start up, although waiting almost 11 minutes does seem very slow. Checking logs for the server container would help to confirm this.

@sleebapaul
Copy link

@agrski I've changed the YAML file a bit.

apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
  name: mlflow
  namespace: seldon-system
spec:
  name: rf-regressor
  predictors:
  - componentSpecs:
    - spec:
        # We are setting high failureThreshold as installing conda dependencies
        # can take long time and we want to avoid k8s killing the container prematurely
        containers:
        - name: regressor
          livenessProbe:
            initialDelaySeconds: 80
            failureThreshold: 200
            periodSeconds: 5
            successThreshold: 1
            httpGet:
              path: /health/ping
              port: http
              scheme: HTTP
          readinessProbe:
            initialDelaySeconds: 80
            failureThreshold: 200
            periodSeconds: 5
            successThreshold: 1
            httpGet:
              path: /health/ping
              port: http
              scheme: HTTP
    graph:
      children: []
      implementation: MLFLOW_SERVER
      modelUri: s3://models/rf-regressor
      envSecretRefName: seldon-rclone-secret
      name: regressor
    name: default
    replicas: 1

The regressor container is crashing because the Conda is not able to find the /microservice/requirements.txt file.

INFO:root:Install additional package from requirements.txt
ERROR conda.cli.main_run:execute(34): Subprocess for 'conda run ['pip', 'install', '-r', '/microservice/requirements.txt']' command failed.  (See above for error)
WARNING: The directory '/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you should use sudo's -H flag.
  ERROR: Command errored out with exit status 1:
   command: /opt/conda/envs/mlflow/bin/python3.9 /opt/conda/envs/mlflow/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py get_requires_for_build_wheel /tmp/tmpl_sq464v
       cwd: /microservice/python
  Complete output (3 lines):
  running egg_info
  creating seldon_core.egg-info
  error: could not create 'seldon_core.egg-info': Permission denied
  ----------------------------------------
WARNING: Discarding file:///microservice/python. Command errored out with exit status 1: /opt/conda/envs/mlflow/bin/python3.9 /opt/conda/envs/mlflow/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py get_requires_for_build_wheel /tmp/tmpl_sq464v Check the logs for full command output.
ERROR: Command errored out with exit status 1: /opt/conda/envs/mlflow/bin/python3.9 /opt/conda/envs/mlflow/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py get_requires_for_build_wheel /tmp/tmpl_sq464v Check the logs for full command output.

Processing ./python
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'error'

Traceback (most recent call last):
  File "./conda_env_create.py", line 151, in <module>
    main(args)
  File "./conda_env_create.py", line 146, in main
    setup_env(model_folder)
  File "./conda_env_create.py", line 55, in setup_env
    install_base_reqs()
  File "./conda_env_create.py", line 136, in install_base_reqs
    run(cmd, shell=True, check=True)
  File "/opt/conda/lib/python3.7/subprocess.py", line 512, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command 'conda run -n mlflow pip install -r /microservice/requirements.txt' returned non-zero exit status 1.

@axsaucedo
Copy link
Contributor

axsaucedo commented Oct 28, 2021

In regards ot the readiness probe it seems you have fixed it but you can look at this example for reference https://github.com/SeldonIO/seldon-core/blob/master/servers/mlflowserver/samples/elasticnet_wine.yaml

In regards to your second error, this was fixed recently via #3670, so you will have to use the DEV images from MASTER, namely you can just install Seldon Core from master which will configure all 1.12.0-dev images by cloning the repo, and running the helm chart directly from the folder - the below command shows how normally youd set it up with istio:

helm upgrade --install seldon-core helm-charts/seldon-core-operator/ --namespace seldon-system --set istio.enabled="true" --set istio.gateway="seldon-gateway.istio-system.svc.cluster.local"

@sleebapaul
Copy link

sleebapaul commented Oct 29, 2021

No luck gentlemen. @axsaucedo @agrski

Let me jot down every command I've used.

    curl -L https://istio.io/downloadIstio | sh -
    cd istio-1.11.4
    export PATH=$PWD/bin:$PATH
    istioctl install --set profile=minimal -y
   
    kubectl create namespace seldon
    kubectl config set-context $(kubectl config current-context) --namespace=seldon

    kubectl create namespace seldon-system
    
    helm install seldon-core seldon-core-operator --repo https://storage.googleapis.com/seldon-charts --set istio.enabled=true --set istio.gateway="seldon-gateway.istio-system.svc.cluster.local" --set usageMetrics.enabled=true --namespace seldon-system

    kubectl rollout status deploy/seldon-controller-manager -n seldon-system

    helm install seldon-core-analytics seldon-core-analytics --namespace seldon-system --repo https://storage.googleapis.com/seldon-charts --set grafana.adminPassword=password --set grafana.adminUser=admin
   
    git clone https://github.com/SeldonIO/seldon-core/
    cd seldon-core

    helm upgrade --install seldon-core helm-charts/seldon-core-operator/ --namespace seldon-system --set     istio.enabled="true" --set istio.gateway="seldon-gateway.istio-system.svc.cluster.local" --set ambassador.enabled="true"

    kubectl create ns minio-system 
    
    helm repo add minio https://helm.min.io/
    helm install minio minio/minio --set accessKey=minioadmin \
    --set secretKey=minioadmin --namespace minio-system

    kubectl describe pods --namespace minio-system

    export POD_NAME=$(kubectl get pods --namespace minio-system -l "release=minio" -o jsonpath="{.items[0].metadata.name}")

    kubectl port-forward $POD_NAME 9000 --namespace minio-system

    mc config host add minio-local http://localhost:9000 minioadmin minioadmin

    mc rb --force minio-local/models
    mc mb minio-local/models
    mc cp -r experiments/buckets/mlflow/0/<experiment-id>/artifacts/ minio-local/models/

    kubectl apply -f seldon-rclone-secret.yaml
    kubectl apply -f deploy.yaml

ERROR

Traceback (most recent call last):
File "./conda_env_create.py", line 151, in <module>
  main(args)
File "./conda_env_create.py", line 146, in main
  setup_env(model_folder)
File "./conda_env_create.py", line 55, in setup_env
  install_base_reqs()
File "./conda_env_create.py", line 136, in install_base_reqs
  run(cmd, shell=True, check=True)
File "/opt/conda/lib/python3.7/subprocess.py", line 512, in run
  output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command 'conda run -n mlflow pip install -r /microservice/requirements.txt' returned non-zero exit status 1.

@sleebapaul
Copy link

@axsaucedo Following up on the issue, any updates? I'm using Python 3.9. Is that a problem?

@sleebapaul
Copy link

sleebapaul commented Nov 12, 2021

@axsaucedo @agrski The issue was caused by the Python version as I suspected. So I downgraded to Python 3.7.12.

  • conda.yaml
channels:
- conda-forge
dependencies:
- python=3.7.12
- pip
- pip:
  - mlflow
  - cloudpickle==2.0.0
  - scikit-learn==0.23.2
name: mlflow-env
  • deploy.yaml
apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
  name: mlflow
spec:
  name: rf-regressor
  predictors:
  - componentSpecs:
    - spec:
        # We are setting high failureThreshold as installing conda dependencies
        # can take long time and we want to avoid k8s killing the container prematurely
        containers:
        - name: regressor
          livenessProbe:
            initialDelaySeconds: 80
            failureThreshold: 200
            periodSeconds: 5
            successThreshold: 1
            httpGet:
              path: /health/ping
              port: http
              scheme: HTTP
          readinessProbe:
            initialDelaySeconds: 80
            failureThreshold: 200
            periodSeconds: 5
            successThreshold: 1
            httpGet:
              path: /health/ping
              port: http
              scheme: HTTP
    graph:
      children: []
      implementation: MLFLOW_SERVER
      modelUri: s3://models/rf-regressor
      envSecretRefName: seldon-rclone-secret
      name: regressor
    name: default
    replicas: 1

Now I've the following issue. The process stops while copying the model. It's been a while since I've been trying to deploy a simple model using MLFLOWSERVER with Seldon. Please let me know what is the working combination of versions of Python, Seldon and MLFLOWSERVER to complete the task.

Screenshot 2021-11-13 at 12 08 10 AM

@axsaucedo
Copy link
Contributor

Perfect - closing given that this has been answered, please reopen if still issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants