Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Liveness probe kills seldon engine container when model predict function takes a long time to send a result #753

Closed
pisymbol opened this issue Aug 5, 2019 · 20 comments

Comments

@pisymbol
Copy link

pisymbol commented Aug 5, 2019

Seldon-Core 0.3.0
Client Version: 1.15
Server Version: 1.13

What's happening is we have a long pipeline for model segmentation (a color image gets segmented into hundreds of images, some DL is performed, then reassembled and finally returned).

The whole pipeline works until the very end when we are returning the largish image (4032, 3024, 3) over the wire. The network latency is probably around 1-2 minutes through my VPN connection. As a result, the liveness probe kills the container like so:

019-08-05 20:18:55.505  WARN 1 --- [pool-2-thread-1] i.s.e.api.rest.SeldonGraphReadyChecker   : Failed to connect to localhost:9000
2019-08-05 20:18:55.505  WARN 1 --- [pool-2-thread-1] i.s.e.api.rest.SeldonGraphReadyChecker   : Failed to connect to localhost:9000
2019-08-05 20:18:55.505  WARN 1 --- [pool-2-thread-1] i.s.e.api.rest.SeldonGraphReadyChecker   : Failed to connect to localhost:9000
2019-08-05 20:18:55.505  WARN 1 --- [pool-2-thread-1] i.s.e.api.rest.SeldonGraphReadyChecker   : Failing localhost:9000
2019-08-05 20:18:55.506  INFO 1 --- [pool-2-thread-1] i.s.e.api.rest.SeldonGraphReadyChecker   : aicv-pmg-graph not ready!
2019-08-05 20:18:55.801  WARN 1 --- [nio-8082-exec-9] i.s.e.api.rest.RestClientController      : Not ready graph checker false, controller true
2019-08-05 20:18:56.801  WARN 1 --- [io-8082-exec-10] i.s.e.api.rest.RestClientController      : Not ready graph checker false, controller true
2019-08-05 20:18:57.801  WARN 1 --- [nio-8082-exec-1] i.s.e.api.rest.RestClientController      : Not ready graph checker false, controller true
2019-08-05 20:18:58.587  WARN 1 --- [nio-8082-exec-2] i.s.e.api.rest.RestClientController      : Not ready graph checker false, controller true
2019-08-05 20:18:58.801  WARN 1 --- [nio-8082-exec-3] i.s.e.api.rest.RestClientController      : Not ready graph checker false, controller true
2019-08-05 20:18:59.801  WARN 1 --- [nio-8082-exec-4] i.s.e.api.rest.RestClientController      : Not ready graph checker false, controller true

I can literally watch these message appear as seldon is serializing the image and sending it over to the wire. As a result, the client sees a 502 response in its request.

How can I avoid this from happening? This is very similar to Issue #674 but happens at the end of predict.

@pisymbol pisymbol changed the title Liveness probe kills seldon engine container when model predict function takes a long time to produce a result Liveness probe kills seldon engine container when model predict function takes a long time to send a result Aug 5, 2019
@axsaucedo
Copy link
Contributor

Interesting. This would actually get fixed once #684 is merged (as a quick fix), as that pr will enable for multiple workers which will be able to respond to the liveness probe whilst the longer request is processing. For the meantime you can configure your own liveness probes to provide a longer time - have you tried extending the liveness probe times?

@pisymbol
Copy link
Author

pisymbol commented Aug 6, 2019

Alright, configured my liveness probe with an insanely large timeoutSeconds. No dice. It just fails like clock work:

2019-08-06 01:02:59.838  WARN 1 --- [pool-2-thread-1] i.s.e.api.rest.SeldonGraphReadyChecker   : Failed to connect to localhost:9000
2019-08-06 01:02:59.839  WARN 1 --- [pool-2-thread-1] i.s.e.api.rest.SeldonGraphReadyChecker   : Failed to connect to localhost:9000
2019-08-06 01:02:59.839  WARN 1 --- [pool-2-thread-1] i.s.e.api.rest.SeldonGraphReadyChecker   : Failed to connect to localhost:9000
2019-08-06 01:02:59.839  WARN 1 --- [pool-2-thread-1] i.s.e.api.rest.SeldonGraphReadyChecker   : Failing localhost:9000
2019-08-06 01:02:59.839  INFO 1 --- [pool-2-thread-1] i.s.e.api.rest.SeldonGraphReadyChecker   : aicv-pmg-graph not ready!
2019-08-06 01:03:00.612  WARN 1 --- [nio-8082-exec-8] i.s.e.api.rest.RestClientController      : Not ready graph checker false, controller true
2019-08-06 01:03:01.612  WARN 1 --- [nio-8082-exec-9] i.s.e.api.rest.RestClientController      : Not ready graph checker false, controller true
2019-08-06 01:03:02.612  WARN 1 --- [io-8082-exec-10] i.s.e.api.rest.RestClientController      : Not ready graph checker false, controller true
2019-08-06 01:03:02.889  WARN 1 --- [nio-8082-exec-1] i.s.e.api.rest.RestClientController      : Not ready graph checker false, controller true
2019-08-06 01:03:03.612  WARN 1 --- [nio-8082-exec-2] i.s.e.api.rest.RestClientController      : Not ready graph checker false, controller true
2019-08-06 01:03:04.612  WARN 1 --- [nio-8082-exec-3] i.s.e.api.rest.RestClientController      : Not ready graph checker false, controller true
2019-08-06 01:03:04.840  WARN 1 --- [pool-2-thread-1] i.s.e.api.rest.SeldonGraphReadyChecker   : Failed to connect to localhost:9000
2019-08-06 01:03:04.840  WARN 1 --- [pool-2-thread-1] i.s.e.api.rest.SeldonGraphReadyChecker   : Failed to connect to localhost:9000
2019-08-06 01:03:04.840  WARN 1 --- [pool-2-thread-1] i.s.e.api.rest.SeldonGraphReadyChecker   : Failed to connect to localhost:9000
2019-08-06 01:03:04.840  WARN 1 --- [pool-2-thread-1] i.s.e.api.rest.SeldonGraphReadyChecker   : Failing localhost:9000
2019-08-06 01:03:04.840  INFO 1 --- [pool-2-thread-1] i.s.e.api.rest.SeldonGraphReadyChecker   : aicv-pmg-graph not ready!
2019-08-06 01:03:05.612  WARN 1 --- [nio-8082-exec-4] i.s.e.api.rest.RestClientController      : Not ready graph checker false, controller true
2019-08-06 01:03:06.612  WARN 1 --- [nio-8082-exec-5] i.s.e.api.rest.RestClientController      : Not ready graph checker false, controller true
2019-08-06 01:03:07.612  WARN 1 --- [nio-8082-exec-6] i.s.e.api.rest.RestClientController      : Not ready graph checker false, controller true
2019-08-06 01:03:07.889  WARN 1 --- [nio-8082-exec-7] i.s.e.api.rest.RestClientController      : Not ready graph checker false, controller true
2019-08-06 01:03:08.612  WARN 1 --- [nio-8082-exec-8] i.s.e.api.rest.RestClientController      : Not ready graph checker false, controller true
2019-08-06 01:03:09.612  WARN 1 --- [nio-8082-exec-9] i.s.e.api.rest.RestClientController      : Not ready graph checker false, controller true
2019-08-06 01:03:09.841  WARN 1 --- [pool-2-thread-1] i.s.e.api.rest.SeldonGraphReadyChecker   : Failed to connect to localhost:9000
2019-08-06 01:03:09.841  WARN 1 --- [pool-2-thread-1] i.s.e.api.rest.SeldonGraphReadyChecker   : Failed to connect to localhost:9000
2019-08-06 01:03:09.841  WARN 1 --- [pool-2-thread-1] i.s.e.api.rest.SeldonGraphReadyChecker   : Failed to connect to localhost:9000
2019-08-06 01:03:09.841  WARN 1 --- [pool-2-thread-1] i.s.e.api.rest.SeldonGraphReadyChecker   : Failing localhost:9000
2019-08-06 01:03:09.841  INFO 1 --- [pool-2-thread-1] i.s.e.api.rest.SeldonGraphReadyChecker   : aicv-pmg-graph not ready!
2019-08-06 01:03:10.612  WARN 1 --- [io-8082-exec-10] i.s.e.api.rest.RestClientController      : Not ready graph checker false, controller true
2019-08-06 01:03:11.612  WARN 1 --- [nio-8082-exec-1] i.s.e.api.rest.RestClientController      : Not ready graph checker false, controller true
2019-08-06 01:03:12.612  WARN 1 --- [nio-8082-exec-2] i.s.e.api.rest.RestClientController      : Not ready graph checker false, controller true
2019-08-06 01:03:12.889  WARN 1 --- [nio-8082-exec-3] i.s.e.api.rest.RestClientController      : Not ready graph checker false, controller true
2019-08-06 01:03:12.962  WARN 1 --- [nio-8000-exec-1] i.s.e.api.rest.RestClientController      : App Paused
2019-08-06 01:03:13.612  WARN 1 --- [nio-8082-exec-4] i.s.e.api.rest.RestClientController      : Not ready graph checker false, controller false
2019-08-06 01:03:14.612  WARN 1 --- [nio-8082-exec-5] i.s.e.api.rest.RestClientController      : Not ready graph checker false, controller false
2019-08-06 01:03:15.612  WARN 1 --- [nio-8082-exec-6] i.s.e.api.rest.RestClientController      : Not ready graph checker true, controller false
2019-08-06 01:03:16.612  WARN 1 --- [nio-8082-exec-7] i.s.e.api.rest.RestClientController      : Not ready graph checker true, controller false
2019-08-06 01:03:17.612  WARN 1 --- [nio-8082-exec-8] i.s.e.api.rest.RestClientController      : Not ready graph checker true, controller false
2019-08-06 01:03:18.612  WARN 1 --- [nio-8082-exec-9] i.s.e.api.rest.RestClientController      : Not ready graph checker true, controller false
2019-08-06 01:03:19.612  WARN 1 --- [io-8082-exec-10] i.s.e.api.rest.RestClientController      : Not ready graph checker true, controller false
2019-08-06 01:03:20.612  WARN 1 --- [nio-8082-exec-1] i.s.e.api.rest.RestClientController      : Not ready graph checker true, controller false
2019-08-06 01:03:21.612  WARN 1 --- [nio-8082-exec-2] i.s.e.api.rest.RestClientController      : Not ready graph checker true, controller false
2019-08-06 01:03:22.612  WARN 1 --- [nio-8082-exec-3] i.s.e.api.rest.RestClientController      : Not ready graph checker true, controller false
2019-08-06 01:03:23.028  INFO 1 --- [       Thread-7] io.seldon.engine.App$GracefulShutdown    : Starting graceful shutdown of Tomcat

Note my model is REST based.

@pisymbol
Copy link
Author

pisymbol commented Aug 6, 2019

The other thing strange is that it always happen at the EXACT same time no matter what I set my liveness probe too. What is going on here?

@ryandawsonuk
Copy link
Contributor

I think it's the engine liveness probe failing rather than the one on your model, just like in #674 (comment) Could you try again using the latest snapshot version - 0.3.2-SNAPSHOT ?

@pisymbol
Copy link
Author

pisymbol commented Aug 6, 2019

I agree @ryandawsonuk but I don't know how to change the liveness/readiness probe (if it is even possible). Let me see if I can get the SNAPSHOT installed. Is there any workaround to this issue though? It is preventing ALL of our models from being deployed under Seldon right now.

@ukclivecox
Copy link
Contributor

Have you tried with latest master install of the seldon-operator? As the liveness probe that @ryandawsonuk mentioned should fix your issue.

@pisymbol
Copy link
Author

pisymbol commented Aug 6, 2019

Just tried 0.3.2-SNAPSHOT, same thing.

Are you saying I should try master of everything or just seldon-operator?

@ukclivecox
Copy link
Contributor

Just the seldon-operator should be good enough. Can you check the yaml of your deployment and see if the liveness probe for the engine point to /live.

@pisymbol
Copy link
Author

pisymbol commented Aug 6, 2019

@cliveseldon On 0.3.2-SNAPSHOT:

$ k describe pod/blah | grep -i Liveness
    Liveness:   tcp-socket :http delay=60s timeout=1s period=5s #success=1 #failure=3
    Liveness:   http-get http://:admin/live delay=20s timeout=2s period=5s #success=1 #failure=7


@pisymbol
Copy link
Author

pisymbol commented Aug 6, 2019

Another question, is there a way to just shutoff the seldon-container-engine liveness probe completely or fake it as a temporary workaround?

@ukclivecox
Copy link
Contributor

Not at present.

So its suggesting the java code is not able to respond to the liveness probe with a timeout of 2 secs when processing the return of the image. Are you sure the java has enough memory and its not garbage collecting or otherwise stalled during this time? Have you set engine resource request to a high value and does your cluster node have enough memory?

@pisymbol
Copy link
Author

pisymbol commented Aug 6, 2019

If java was running out of memory, wouldn't we see that in the Spring logs? The pod's node has 64GB of RAM, that's plenty. I've limited my pod to 4GB, I can up to 8GB if you think that will help?

@ukclivecox
Copy link
Contributor

its more about adding more resources to the engine.

@pisymbol
Copy link
Author

pisymbol commented Aug 6, 2019

Which engine? You mean my pod or the seldon-container-engine. I think you mean the latter.

Both are running on a host with 64GB of RAM. Are you talking about giving java more memory?

How do I do that exactly?

Here is the java command you guys run:

java -Djava.security.egd=file:/dev/./urandom -Dcom.sun.management.jmxremote.rmi.port=9090 -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=9090 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.manag
ement.jmxremote.local.only=false -Djava.rmi.server.hostname=127.0.0.1 -jar app.jar

One thing I suggest is run it in '-server' mode or set higher defaults for -Xms and maybe -Xmx.

I was reading Issue #398 but the doc link goes to 404. How can I allocate more resources to the seldon-engine-container sidecar?

@ukclivecox
Copy link
Contributor

you can add a svcOrchSpec section, e.g.:

{
  "apiVersion": "machinelearning.seldon.io/v1alpha2",
  "kind": "SeldonDeployment",
  "metadata": {
    "labels": {
      "app": "seldon"
    },
    "name": "release-name"
  },
  "spec": {
    "annotations": {
      "seldon.io/engine-log-message-type": "seldon.message.pair",
      "seldon.io/engine-log-messages-externally": "false",
      "seldon.io/engine-log-requests": "false",
      "seldon.io/engine-log-responses": "false",
      "seldon.io/headless-svc": "false"
    },
    "name": "release-name",
    "predictors": [
      {
        "componentSpecs": [
          {
            "spec": {
              "containers": [
                {
                  "image": "seldonio/mock_classifier:1.0",
                  "imagePullPolicy": "IfNotPresent",
                  "name": "classifier",
                  "resources": {
                    "requests": {
                      "memory": "1Mi"
                    }
                  },
                  "env": [
                    {
                      "name": "LOG_LEVEL",
                      "value": "INFO"
                    }
                  ]
                }
              ],
              "terminationGracePeriodSeconds": 1
            }
          }
        ],
        "graph": {
          "children": [],
          "name": "classifier",
          "type": "MODEL",
          "endpoint": {
            "type": "REST"
          }
        },
        "svcOrchSpec": {
          "resources": {
            "requests": {
              "cpu": "0.1"
            }
          },
          "env": []
        },
        "name": "release-name",
        "replicas": 1,
        "labels": {
          "fluentd": "true",
          "version": "v1"
        }
      }
    ]
  }
}

@pisymbol
Copy link
Author

pisymbol commented Aug 6, 2019

Ok, slightly different now. Readiness probe failed not the liveness one. That was with 4Gi. I'm going to double it and try again (if possible, talking to admin about getting more resources).

@pisymbol
Copy link
Author

pisymbol commented Aug 6, 2019

Eureka! 8Gi worked!!!!!!!

Alright, so how can this error be diagnosed a bit better in the future? And is there anything future model deployers should know? I mean this really hit me out of left field without a lot of things to go on.

@ukclivecox
Copy link
Contributor

great!

Some metrics from a k8s dashboard that shows the java engine is thrashing/garbage collecting would be ideal. We should be able to add engine memory, cpu metrics or something that shows the pod is under heavy load.

@pisymbol
Copy link
Author

pisymbol commented Aug 6, 2019

Thanks again Clive et al.

I would have given you a few bucks for just a warning or alert message from the side car container that memory is low.

What I do suggest is that in future releases deployers have more control over the seldon-container-engine side car resources etc.

@ryandawsonuk
Copy link
Contributor

Have created #761 as a follow-up action

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants