Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Concurrency in Model REST server #453

Closed
NegatioN opened this issue Feb 20, 2019 · 5 comments · Fixed by #684
Closed

Question: Concurrency in Model REST server #453

NegatioN opened this issue Feb 20, 2019 · 5 comments · Fixed by #684

Comments

@NegatioN
Copy link

I was wondering if you have any tests on how the REST API for MODEL type deployments handle concurrency?
I'm guessing this is the part of the code that handles it, so I would expect the Tornado coroutines to do something: https://github.com/SeldonIO/seldon-core/blob/master/python/seldon_core/model_microservice.py#L228-L262

However when testing with ab to throw some concurrent traffic at the API I see very linear scaling of the processing-time no matter if the model is extremely simple (a simple one layer model going from 100x1) or if it's more complex (Embeddings, RNNs, DotProducts).

Example:
Note: This is currently while running both the docker server and ab on the same machine, so I don't except everything to fly off the moon, but I expected some concurrency on a 4core machine with hyperthreading.
The problem seems to be identical when running the server with the CLI seldon-core-microservice Model REST, although with less network overhead.

Concurrency 1

ab -p model_input -T "application/x-www-form-urlencoded" -c 1 -n 1000 http://0.0.0.0:5000/predict
This is ApacheBench, Version 2.3 <$Revision: 1826891 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 0.0.0.0 (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests


Server Software:        Werkzeug/0.14.1
Server Hostname:        0.0.0.0
Server Port:            5000

Document Path:          /predict
Document Length:        167 bytes

Concurrency Level:      1
Time taken for tests:   28.145 seconds
Complete requests:      1000
Failed requests:        0
Total transferred:      346000 bytes
Total body sent:        235000
HTML transferred:       167000 bytes
Requests per second:    35.53 [#/sec] (mean)
Time per request:       28.145 [ms] (mean)
Time per request:       28.145 [ms] (mean, across all concurrent requests)
Transfer rate:          12.01 [Kbytes/sec] received
                        8.15 kb/s sent
                        20.16 kb/s total

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.1      0       1
Processing:    17   28  36.0     26    1039
Waiting:       17   28  36.0     26    1038
Total:         17   28  36.0     26    1039

Percentage of the requests served within a certain time (ms)
  50%     26
  66%     28
  75%     29
  80%     29
  90%     31
  95%     32
  98%     36
  99%     39
 100%   1039 (longest request)

Concurrency 5

ab -p model_input -T "application/x-www-form-urlencoded" -c 5 -n 1000 http://0.0.0.0:5000/predict
This is ApacheBench, Version 2.3 <$Revision: 1826891 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 0.0.0.0 (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests


Server Software:        Werkzeug/0.14.1
Server Hostname:        0.0.0.0
Server Port:            5000

Document Path:          /predict
Document Length:        167 bytes

Concurrency Level:      5
Time taken for tests:   27.359 seconds
Complete requests:      1000
Failed requests:        0
Total transferred:      346000 bytes
Total body sent:        235000
HTML transferred:       167000 bytes
Requests per second:    36.55 [#/sec] (mean)
Time per request:       136.797 [ms] (mean)
Time per request:       27.359 [ms] (mean, across all concurrent requests)
Transfer rate:          12.35 [Kbytes/sec] received
                        8.39 kb/s sent
                        20.74 kb/s total

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.1      0       0
Processing:    30  136  81.6    131    1207
Waiting:       29  122  72.2    118    1156
Total:         30  136  81.6    131    1207

Percentage of the requests served within a certain time (ms)
  50%    131
  66%    141
  75%    155
  80%    162
  90%    182
  95%    199
  98%    233
  99%    255
 100%   1207 (longest request)

Am I doing something wrong here, or is the REST api intended to scale on a pod-level, not at the thread level?

Or are you intending for gRPC to be used in this case?

Sorry for flooding your issues! :)

@NegatioN
Copy link
Author

And right after posting I found this on the roadmap #108 and I'm guessing since we use Flask on the highest level we are somewhat limited to a single effective thread?

https://github.com/SeldonIO/seldon-core/blob/master/python/seldon_core/model_microservice.py#L57

@ukclivecox
Copy link
Contributor

Hi, yes for the roadmap we presently have considered gunicorn or tornado. We want to prioritise this work and would welcome your feedback. Sagemaker uses gunicorn I see. I think there should be a configurable number of threads started. We would welcome your thoughts.

@NegatioN
Copy link
Author

I definitely like the intuitive options that are available when running TF-serving like:
--rest_api_num_threads which would be similar to the number of threads allowed for Tornado or Gunicorn here.

Beyond that I'm not sure I have much to add. I haven't looked at gunicorn or Tornado much. We have some internal java applications that solve things in a similar way with threadpools. It seems to be the norm, and having that kind of an option is very useful for us in a production setting at least.

@tszumowski
Copy link

Hi all. This thread is great timing! I was just playing around with some of the Seldon tutorials this morning and dove into the source, curious on how the REST endpoints were being served. I figured I'd chime in with some recent experience with gunicorn.

Regarding a server option, using just the built-in Flask webserver can be dangerous in deployment. Per the docs:

While lightweight and easy to use, Flask’s built-in server is not suitable for production as it doesn’t scale well. Some of the options available for properly running Flask in production are documented here.

On that page they list several servers: gunicorn, uWSGI, Tornado, etc. Looking online, I've found there are a lot of (opinionated) posts online regarding which is ideal (example, example). There are also some benchmarks out there for different servers (example, example).

Our engineering organization has used Tornado extensively in the past. More recently for ML deployment experimentation, I've been focused on gunicorn. This is for no reason other than convenience. Both AWS Sagemaker's scikit-learn example and MLFlow's server used it, and it seemed simple, so I figured "why not".

For gunicorn, I believe the power really comes in how the workers are defined, and how many workers you can allocate per pod on Kubernetes. Theoretically one can just let Kubernetes scale a single-worker server, but pod-scaling is probably less reactive than worker-management on a single pod.

For my tests, I was utilizing some Google cloud datastores within the deployed gunicorn service (prototype initially). It turned out the gevent worker had some incompatibilities (link, link) so that may be something to keep in mind too.

Regardless of the solution seldon selects, it may be worth posting a disclaimer in the docs somewhere that the REST endpoint currently uses the Flask web server which Flask recommends not to use in production.

@ukclivecox
Copy link
Contributor

Thanks @tszumowski very useful info and comments. I think we are probably also viewing gunicorn as the likely addition.

@ukclivecox ukclivecox modified the milestones: 0.2.x, 0.3.x Jun 3, 2019
agrski added a commit that referenced this issue Dec 2, 2022
* Add string method on server snapshot pointers

Print server names rather than memory addresses when working with lists of pointers, as in issue #449.

* Fix typo in log message

* Update function context field in Kafka consumer logger

* Use consistent var naming for Kafka config maps

* Create fresh config for Kafka admin client in model gateway

This avoids librdkafka complaining about irrelevant config options.
In fact, at the time of writing the version of confluent-kafka-go we use (v1.9.1) creates a producer under the hood,
as apparently this is slightly cheaper than a consumer.
In any case, we do not want to pass irrelevant config options which create noisy logs,
nor do we want to rely on knowledge of the Go-Kafka integration's implementation.

* Add producer config as context field instead of in-line in log message

* Convert producer config to JSON for use as log context field

* Add consumer config as JSON context field instead of inline in log message

* Use specific logger not general one in setup method

* Add tracing config as (JSON) context field in log message instead of as inline string

* Grammatical improvements/fixes

* Use debug level when logging event hub messages

* Formatting

Add blank lines for logical grouping.
Split conditions on long lines for legibility.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants