Liveness probe kills seldon engine container when model predict function takes a long time to send a result #753

pisymbol · 2019-08-05T20:50:30Z

Seldon-Core 0.3.0
Client Version: 1.15
Server Version: 1.13

What's happening is we have a long pipeline for model segmentation (a color image gets segmented into hundreds of images, some DL is performed, then reassembled and finally returned).

The whole pipeline works until the very end when we are returning the largish image (4032, 3024, 3) over the wire. The network latency is probably around 1-2 minutes through my VPN connection. As a result, the liveness probe kills the container like so:

019-08-05 20:18:55.505  WARN 1 --- [pool-2-thread-1] i.s.e.api.rest.SeldonGraphReadyChecker   : Failed to connect to localhost:9000
2019-08-05 20:18:55.505  WARN 1 --- [pool-2-thread-1] i.s.e.api.rest.SeldonGraphReadyChecker   : Failed to connect to localhost:9000
2019-08-05 20:18:55.505  WARN 1 --- [pool-2-thread-1] i.s.e.api.rest.SeldonGraphReadyChecker   : Failed to connect to localhost:9000
2019-08-05 20:18:55.505  WARN 1 --- [pool-2-thread-1] i.s.e.api.rest.SeldonGraphReadyChecker   : Failing localhost:9000
2019-08-05 20:18:55.506  INFO 1 --- [pool-2-thread-1] i.s.e.api.rest.SeldonGraphReadyChecker   : aicv-pmg-graph not ready!
2019-08-05 20:18:55.801  WARN 1 --- [nio-8082-exec-9] i.s.e.api.rest.RestClientController      : Not ready graph checker false, controller true
2019-08-05 20:18:56.801  WARN 1 --- [io-8082-exec-10] i.s.e.api.rest.RestClientController      : Not ready graph checker false, controller true
2019-08-05 20:18:57.801  WARN 1 --- [nio-8082-exec-1] i.s.e.api.rest.RestClientController      : Not ready graph checker false, controller true
2019-08-05 20:18:58.587  WARN 1 --- [nio-8082-exec-2] i.s.e.api.rest.RestClientController      : Not ready graph checker false, controller true
2019-08-05 20:18:58.801  WARN 1 --- [nio-8082-exec-3] i.s.e.api.rest.RestClientController      : Not ready graph checker false, controller true
2019-08-05 20:18:59.801  WARN 1 --- [nio-8082-exec-4] i.s.e.api.rest.RestClientController      : Not ready graph checker false, controller true

I can literally watch these message appear as seldon is serializing the image and sending it over to the wire. As a result, the client sees a 502 response in its request.

How can I avoid this from happening? This is very similar to Issue #674 but happens at the end of predict.

The text was updated successfully, but these errors were encountered:

axsaucedo · 2019-08-05T21:08:11Z

Interesting. This would actually get fixed once #684 is merged (as a quick fix), as that pr will enable for multiple workers which will be able to respond to the liveness probe whilst the longer request is processing. For the meantime you can configure your own liveness probes to provide a longer time - have you tried extending the liveness probe times?

pisymbol · 2019-08-06T00:03:52Z

Alright, configured my liveness probe with an insanely large timeoutSeconds. No dice. It just fails like clock work:

2019-08-06 01:02:59.838  WARN 1 --- [pool-2-thread-1] i.s.e.api.rest.SeldonGraphReadyChecker   : Failed to connect to localhost:9000
2019-08-06 01:02:59.839  WARN 1 --- [pool-2-thread-1] i.s.e.api.rest.SeldonGraphReadyChecker   : Failed to connect to localhost:9000
2019-08-06 01:02:59.839  WARN 1 --- [pool-2-thread-1] i.s.e.api.rest.SeldonGraphReadyChecker   : Failed to connect to localhost:9000
2019-08-06 01:02:59.839  WARN 1 --- [pool-2-thread-1] i.s.e.api.rest.SeldonGraphReadyChecker   : Failing localhost:9000
2019-08-06 01:02:59.839  INFO 1 --- [pool-2-thread-1] i.s.e.api.rest.SeldonGraphReadyChecker   : aicv-pmg-graph not ready!
2019-08-06 01:03:00.612  WARN 1 --- [nio-8082-exec-8] i.s.e.api.rest.RestClientController      : Not ready graph checker false, controller true
2019-08-06 01:03:01.612  WARN 1 --- [nio-8082-exec-9] i.s.e.api.rest.RestClientController      : Not ready graph checker false, controller true
2019-08-06 01:03:02.612  WARN 1 --- [io-8082-exec-10] i.s.e.api.rest.RestClientController      : Not ready graph checker false, controller true
2019-08-06 01:03:02.889  WARN 1 --- [nio-8082-exec-1] i.s.e.api.rest.RestClientController      : Not ready graph checker false, controller true
2019-08-06 01:03:03.612  WARN 1 --- [nio-8082-exec-2] i.s.e.api.rest.RestClientController      : Not ready graph checker false, controller true
2019-08-06 01:03:04.612  WARN 1 --- [nio-8082-exec-3] i.s.e.api.rest.RestClientController      : Not ready graph checker false, controller true
2019-08-06 01:03:04.840  WARN 1 --- [pool-2-thread-1] i.s.e.api.rest.SeldonGraphReadyChecker   : Failed to connect to localhost:9000
2019-08-06 01:03:04.840  WARN 1 --- [pool-2-thread-1] i.s.e.api.rest.SeldonGraphReadyChecker   : Failed to connect to localhost:9000
2019-08-06 01:03:04.840  WARN 1 --- [pool-2-thread-1] i.s.e.api.rest.SeldonGraphReadyChecker   : Failed to connect to localhost:9000
2019-08-06 01:03:04.840  WARN 1 --- [pool-2-thread-1] i.s.e.api.rest.SeldonGraphReadyChecker   : Failing localhost:9000
2019-08-06 01:03:04.840  INFO 1 --- [pool-2-thread-1] i.s.e.api.rest.SeldonGraphReadyChecker   : aicv-pmg-graph not ready!
2019-08-06 01:03:05.612  WARN 1 --- [nio-8082-exec-4] i.s.e.api.rest.RestClientController      : Not ready graph checker false, controller true
2019-08-06 01:03:06.612  WARN 1 --- [nio-8082-exec-5] i.s.e.api.rest.RestClientController      : Not ready graph checker false, controller true
2019-08-06 01:03:07.612  WARN 1 --- [nio-8082-exec-6] i.s.e.api.rest.RestClientController      : Not ready graph checker false, controller true
2019-08-06 01:03:07.889  WARN 1 --- [nio-8082-exec-7] i.s.e.api.rest.RestClientController      : Not ready graph checker false, controller true
2019-08-06 01:03:08.612  WARN 1 --- [nio-8082-exec-8] i.s.e.api.rest.RestClientController      : Not ready graph checker false, controller true
2019-08-06 01:03:09.612  WARN 1 --- [nio-8082-exec-9] i.s.e.api.rest.RestClientController      : Not ready graph checker false, controller true
2019-08-06 01:03:09.841  WARN 1 --- [pool-2-thread-1] i.s.e.api.rest.SeldonGraphReadyChecker   : Failed to connect to localhost:9000
2019-08-06 01:03:09.841  WARN 1 --- [pool-2-thread-1] i.s.e.api.rest.SeldonGraphReadyChecker   : Failed to connect to localhost:9000
2019-08-06 01:03:09.841  WARN 1 --- [pool-2-thread-1] i.s.e.api.rest.SeldonGraphReadyChecker   : Failed to connect to localhost:9000
2019-08-06 01:03:09.841  WARN 1 --- [pool-2-thread-1] i.s.e.api.rest.SeldonGraphReadyChecker   : Failing localhost:9000
2019-08-06 01:03:09.841  INFO 1 --- [pool-2-thread-1] i.s.e.api.rest.SeldonGraphReadyChecker   : aicv-pmg-graph not ready!
2019-08-06 01:03:10.612  WARN 1 --- [io-8082-exec-10] i.s.e.api.rest.RestClientController      : Not ready graph checker false, controller true
2019-08-06 01:03:11.612  WARN 1 --- [nio-8082-exec-1] i.s.e.api.rest.RestClientController      : Not ready graph checker false, controller true
2019-08-06 01:03:12.612  WARN 1 --- [nio-8082-exec-2] i.s.e.api.rest.RestClientController      : Not ready graph checker false, controller true
2019-08-06 01:03:12.889  WARN 1 --- [nio-8082-exec-3] i.s.e.api.rest.RestClientController      : Not ready graph checker false, controller true
2019-08-06 01:03:12.962  WARN 1 --- [nio-8000-exec-1] i.s.e.api.rest.RestClientController      : App Paused
2019-08-06 01:03:13.612  WARN 1 --- [nio-8082-exec-4] i.s.e.api.rest.RestClientController      : Not ready graph checker false, controller false
2019-08-06 01:03:14.612  WARN 1 --- [nio-8082-exec-5] i.s.e.api.rest.RestClientController      : Not ready graph checker false, controller false
2019-08-06 01:03:15.612  WARN 1 --- [nio-8082-exec-6] i.s.e.api.rest.RestClientController      : Not ready graph checker true, controller false
2019-08-06 01:03:16.612  WARN 1 --- [nio-8082-exec-7] i.s.e.api.rest.RestClientController      : Not ready graph checker true, controller false
2019-08-06 01:03:17.612  WARN 1 --- [nio-8082-exec-8] i.s.e.api.rest.RestClientController      : Not ready graph checker true, controller false
2019-08-06 01:03:18.612  WARN 1 --- [nio-8082-exec-9] i.s.e.api.rest.RestClientController      : Not ready graph checker true, controller false
2019-08-06 01:03:19.612  WARN 1 --- [io-8082-exec-10] i.s.e.api.rest.RestClientController      : Not ready graph checker true, controller false
2019-08-06 01:03:20.612  WARN 1 --- [nio-8082-exec-1] i.s.e.api.rest.RestClientController      : Not ready graph checker true, controller false
2019-08-06 01:03:21.612  WARN 1 --- [nio-8082-exec-2] i.s.e.api.rest.RestClientController      : Not ready graph checker true, controller false
2019-08-06 01:03:22.612  WARN 1 --- [nio-8082-exec-3] i.s.e.api.rest.RestClientController      : Not ready graph checker true, controller false
2019-08-06 01:03:23.028  INFO 1 --- [       Thread-7] io.seldon.engine.App$GracefulShutdown    : Starting graceful shutdown of Tomcat

Note my model is REST based.

pisymbol · 2019-08-06T01:14:50Z

The other thing strange is that it always happen at the EXACT same time no matter what I set my liveness probe too. What is going on here?

ryandawsonuk · 2019-08-06T08:39:27Z

I think it's the engine liveness probe failing rather than the one on your model, just like in #674 (comment) Could you try again using the latest snapshot version - 0.3.2-SNAPSHOT ?

pisymbol · 2019-08-06T12:06:53Z

I agree @ryandawsonuk but I don't know how to change the liveness/readiness probe (if it is even possible). Let me see if I can get the SNAPSHOT installed. Is there any workaround to this issue though? It is preventing ALL of our models from being deployed under Seldon right now.

ukclivecox · 2019-08-06T15:57:47Z

Have you tried with latest master install of the seldon-operator? As the liveness probe that @ryandawsonuk mentioned should fix your issue.

pisymbol · 2019-08-06T16:01:11Z

Just tried 0.3.2-SNAPSHOT, same thing.

Are you saying I should try master of everything or just seldon-operator?

ukclivecox · 2019-08-06T16:06:59Z

Just the seldon-operator should be good enough. Can you check the yaml of your deployment and see if the liveness probe for the engine point to /live.

pisymbol · 2019-08-06T16:09:22Z

@cliveseldon On 0.3.2-SNAPSHOT:

$ k describe pod/blah | grep -i Liveness
    Liveness:   tcp-socket :http delay=60s timeout=1s period=5s #success=1 #failure=3
    Liveness:   http-get http://:admin/live delay=20s timeout=2s period=5s #success=1 #failure=7

pisymbol · 2019-08-06T16:11:05Z

Another question, is there a way to just shutoff the seldon-container-engine liveness probe completely or fake it as a temporary workaround?

ukclivecox · 2019-08-06T16:21:16Z

Not at present.

So its suggesting the java code is not able to respond to the liveness probe with a timeout of 2 secs when processing the return of the image. Are you sure the java has enough memory and its not garbage collecting or otherwise stalled during this time? Have you set engine resource request to a high value and does your cluster node have enough memory?

pisymbol · 2019-08-06T16:51:15Z

If java was running out of memory, wouldn't we see that in the Spring logs? The pod's node has 64GB of RAM, that's plenty. I've limited my pod to 4GB, I can up to 8GB if you think that will help?

ukclivecox · 2019-08-06T16:54:48Z

its more about adding more resources to the engine.

pisymbol · 2019-08-06T17:02:22Z

Which engine? You mean my pod or the seldon-container-engine. I think you mean the latter.

Both are running on a host with 64GB of RAM. Are you talking about giving java more memory?

How do I do that exactly?

Here is the java command you guys run:

java -Djava.security.egd=file:/dev/./urandom -Dcom.sun.management.jmxremote.rmi.port=9090 -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=9090 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.manag
ement.jmxremote.local.only=false -Djava.rmi.server.hostname=127.0.0.1 -jar app.jar

One thing I suggest is run it in '-server' mode or set higher defaults for -Xms and maybe -Xmx.

I was reading Issue #398 but the doc link goes to 404. How can I allocate more resources to the seldon-engine-container sidecar?

ukclivecox · 2019-08-06T17:36:49Z

you can add a svcOrchSpec section, e.g.:

{
  "apiVersion": "machinelearning.seldon.io/v1alpha2",
  "kind": "SeldonDeployment",
  "metadata": {
    "labels": {
      "app": "seldon"
    },
    "name": "release-name"
  },
  "spec": {
    "annotations": {
      "seldon.io/engine-log-message-type": "seldon.message.pair",
      "seldon.io/engine-log-messages-externally": "false",
      "seldon.io/engine-log-requests": "false",
      "seldon.io/engine-log-responses": "false",
      "seldon.io/headless-svc": "false"
    },
    "name": "release-name",
    "predictors": [
      {
        "componentSpecs": [
          {
            "spec": {
              "containers": [
                {
                  "image": "seldonio/mock_classifier:1.0",
                  "imagePullPolicy": "IfNotPresent",
                  "name": "classifier",
                  "resources": {
                    "requests": {
                      "memory": "1Mi"
                    }
                  },
                  "env": [
                    {
                      "name": "LOG_LEVEL",
                      "value": "INFO"
                    }
                  ]
                }
              ],
              "terminationGracePeriodSeconds": 1
            }
          }
        ],
        "graph": {
          "children": [],
          "name": "classifier",
          "type": "MODEL",
          "endpoint": {
            "type": "REST"
          }
        },
        "svcOrchSpec": {
          "resources": {
            "requests": {
              "cpu": "0.1"
            }
          },
          "env": []
        },
        "name": "release-name",
        "replicas": 1,
        "labels": {
          "fluentd": "true",
          "version": "v1"
        }
      }
    ]
  }
}

pisymbol · 2019-08-06T18:22:35Z

Ok, slightly different now. Readiness probe failed not the liveness one. That was with 4Gi. I'm going to double it and try again (if possible, talking to admin about getting more resources).

pisymbol · 2019-08-06T18:40:29Z

Eureka! 8Gi worked!!!!!!!

Alright, so how can this error be diagnosed a bit better in the future? And is there anything future model deployers should know? I mean this really hit me out of left field without a lot of things to go on.

ukclivecox · 2019-08-06T19:51:20Z

great!

Some metrics from a k8s dashboard that shows the java engine is thrashing/garbage collecting would be ideal. We should be able to add engine memory, cpu metrics or something that shows the pod is under heavy load.

pisymbol · 2019-08-06T19:54:37Z

Thanks again Clive et al.

I would have given you a few bucks for just a warning or alert message from the side car container that memory is low.

What I do suggest is that in future releases deployers have more control over the seldon-container-engine side car resources etc.

ryandawsonuk · 2019-08-07T09:00:49Z

Have created #761 as a follow-up action

pisymbol changed the title ~~Liveness probe kills seldon engine container when model predict function takes a long time to produce a result~~ Liveness probe kills seldon engine container when model predict function takes a long time to send a result Aug 5, 2019

ryandawsonuk mentioned this issue Aug 7, 2019

make it clearer how and when to set engine resources #761

Closed

ryandawsonuk closed this as completed Aug 7, 2019

ukclivecox mentioned this issue Aug 8, 2019

Long running tasks not working with REST Clients (CURL, Postman, etc) #742

Closed

ryandawsonuk mentioned this issue Aug 8, 2019

Seldon container engine has not resource limit #769

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Liveness probe kills seldon engine container when model predict function takes a long time to send a result #753

Liveness probe kills seldon engine container when model predict function takes a long time to send a result #753

pisymbol commented Aug 5, 2019 •

edited

Loading

axsaucedo commented Aug 5, 2019

pisymbol commented Aug 6, 2019 •

edited

Loading

pisymbol commented Aug 6, 2019

ryandawsonuk commented Aug 6, 2019

pisymbol commented Aug 6, 2019

ukclivecox commented Aug 6, 2019

pisymbol commented Aug 6, 2019

ukclivecox commented Aug 6, 2019

pisymbol commented Aug 6, 2019 •

edited

Loading

pisymbol commented Aug 6, 2019

ukclivecox commented Aug 6, 2019

pisymbol commented Aug 6, 2019

ukclivecox commented Aug 6, 2019

pisymbol commented Aug 6, 2019 •

edited

Loading

ukclivecox commented Aug 6, 2019

pisymbol commented Aug 6, 2019 •

edited

Loading

pisymbol commented Aug 6, 2019

ukclivecox commented Aug 6, 2019

pisymbol commented Aug 6, 2019

ryandawsonuk commented Aug 7, 2019

Liveness probe kills seldon engine container when model predict function takes a long time to send a result #753

Liveness probe kills seldon engine container when model predict function takes a long time to send a result #753

Comments

pisymbol commented Aug 5, 2019 • edited Loading

axsaucedo commented Aug 5, 2019

pisymbol commented Aug 6, 2019 • edited Loading

pisymbol commented Aug 6, 2019

ryandawsonuk commented Aug 6, 2019

pisymbol commented Aug 6, 2019

ukclivecox commented Aug 6, 2019

pisymbol commented Aug 6, 2019

ukclivecox commented Aug 6, 2019

pisymbol commented Aug 6, 2019 • edited Loading

pisymbol commented Aug 6, 2019

ukclivecox commented Aug 6, 2019

pisymbol commented Aug 6, 2019

ukclivecox commented Aug 6, 2019

pisymbol commented Aug 6, 2019 • edited Loading

ukclivecox commented Aug 6, 2019

pisymbol commented Aug 6, 2019 • edited Loading

pisymbol commented Aug 6, 2019

ukclivecox commented Aug 6, 2019

pisymbol commented Aug 6, 2019

ryandawsonuk commented Aug 7, 2019

pisymbol commented Aug 5, 2019 •

edited

Loading

pisymbol commented Aug 6, 2019 •

edited

Loading

pisymbol commented Aug 6, 2019 •

edited

Loading

pisymbol commented Aug 6, 2019 •

edited

Loading

pisymbol commented Aug 6, 2019 •

edited

Loading