-
Notifications
You must be signed in to change notification settings - Fork 834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: Concurrency in Model REST server #453
Comments
And right after posting I found this on the roadmap #108 and I'm guessing since we use Flask on the highest level we are somewhat limited to a single effective thread? |
Hi, yes for the roadmap we presently have considered gunicorn or tornado. We want to prioritise this work and would welcome your feedback. Sagemaker uses gunicorn I see. I think there should be a configurable number of threads started. We would welcome your thoughts. |
I definitely like the intuitive options that are available when running TF-serving like: Beyond that I'm not sure I have much to add. I haven't looked at gunicorn or Tornado much. We have some internal java applications that solve things in a similar way with threadpools. It seems to be the norm, and having that kind of an option is very useful for us in a production setting at least. |
Hi all. This thread is great timing! I was just playing around with some of the Seldon tutorials this morning and dove into the source, curious on how the REST endpoints were being served. I figured I'd chime in with some recent experience with gunicorn. Regarding a server option, using just the built-in Flask webserver can be dangerous in deployment. Per the docs:
On that page they list several servers: gunicorn, uWSGI, Tornado, etc. Looking online, I've found there are a lot of (opinionated) posts online regarding which is ideal (example, example). There are also some benchmarks out there for different servers (example, example). Our engineering organization has used Tornado extensively in the past. More recently for ML deployment experimentation, I've been focused on gunicorn. This is for no reason other than convenience. Both AWS Sagemaker's scikit-learn example and MLFlow's server used it, and it seemed simple, so I figured "why not". For gunicorn, I believe the power really comes in how the workers are defined, and how many workers you can allocate per pod on Kubernetes. Theoretically one can just let Kubernetes scale a single-worker server, but pod-scaling is probably less reactive than worker-management on a single pod. For my tests, I was utilizing some Google cloud datastores within the deployed gunicorn service (prototype initially). It turned out the Regardless of the solution seldon selects, it may be worth posting a disclaimer in the docs somewhere that the REST endpoint currently uses the Flask web server which Flask recommends not to use in production. |
Thanks @tszumowski very useful info and comments. I think we are probably also viewing gunicorn as the likely addition. |
* Add string method on server snapshot pointers Print server names rather than memory addresses when working with lists of pointers, as in issue #449. * Fix typo in log message * Update function context field in Kafka consumer logger * Use consistent var naming for Kafka config maps * Create fresh config for Kafka admin client in model gateway This avoids librdkafka complaining about irrelevant config options. In fact, at the time of writing the version of confluent-kafka-go we use (v1.9.1) creates a producer under the hood, as apparently this is slightly cheaper than a consumer. In any case, we do not want to pass irrelevant config options which create noisy logs, nor do we want to rely on knowledge of the Go-Kafka integration's implementation. * Add producer config as context field instead of in-line in log message * Convert producer config to JSON for use as log context field * Add consumer config as JSON context field instead of inline in log message * Use specific logger not general one in setup method * Add tracing config as (JSON) context field in log message instead of as inline string * Grammatical improvements/fixes * Use debug level when logging event hub messages * Formatting Add blank lines for logical grouping. Split conditions on long lines for legibility.
I was wondering if you have any tests on how the REST API for
MODEL
type deployments handle concurrency?I'm guessing this is the part of the code that handles it, so I would expect the Tornado coroutines to do something: https://github.com/SeldonIO/seldon-core/blob/master/python/seldon_core/model_microservice.py#L228-L262
However when testing with
ab
to throw some concurrent traffic at the API I see very linear scaling of the processing-time no matter if the model is extremely simple (a simple one layer model going from 100x1) or if it's more complex (Embeddings, RNNs, DotProducts).Example:
Note: This is currently while running both the docker server and
ab
on the same machine, so I don't except everything to fly off the moon, but I expected some concurrency on a 4core machine with hyperthreading.The problem seems to be identical when running the server with the CLI
seldon-core-microservice Model REST
, although with less network overhead.Concurrency 1
Concurrency 5
Am I doing something wrong here, or is the REST api intended to scale on a pod-level, not at the thread level?
Or are you intending for gRPC to be used in this case?
Sorry for flooding your issues! :)
The text was updated successfully, but these errors were encountered: