SeldonIO · axsaucedo · Jul 7, 2020 · Jun 26, 2020 · Jun 26, 2020 · Jun 26, 2020
diff --git a/doc/source/python/index.rst b/doc/source/python/index.rst
@@ -17,6 +17,7 @@ You can use the following links to navigate the Python seldon-core module:
    Wrap using S2I <python_wrapping_s2i.md>
    Wrap using Docker <python_wrapping_docker.md>
    Seldon Python Client <seldon_client.md>
+   Seldon Python Server <python_server.md>
    Python API reference <api/modules>
 
 
diff --git a/doc/source/python/python_component.md b/doc/source/python/python_component.md
@@ -317,45 +317,33 @@ class UserCustomException(Exception):
 
 ```
 
-### Gunicorn (Alpha Feature)
+## Multi-value numpy arrays
 
-To run your class under gunicorn set the environment variable `GUNICORN_WORKERS` to an integer value > 1.
+By default, when using the data ndarray parameter, the conversion to ndarray (by default) converts all inner types into the same type. With models that may take as input arrays with different value types, you will be able to do so by overriding the `predict_raw` function yourself which gives you access to the raw request, and creating the numpy array as follows:
 
 ```
-apiVersion: machinelearning.seldon.io/v1alpha2
-kind: SeldonDeployment
-metadata:
-  name: gunicorn
-spec:
-  name: worker
-  predictors:
-  - componentSpecs:
-    - spec:
-        containers:
-        - image: seldonio/mock_classifier:1.0
-          name: classifier
-          env:
-          - name: GUNICORN_WORKERS
-            value: '4'
-        terminationGracePeriodSeconds: 1
-    graph:
-      children: []
-      endpoint:
-        type: REST
-      name: classifier
-      type: MODEL
-    labels:
-      version: v1
-    name: example
-    replicas: 1
+import numpy as np
 
+class Model:
+    def predict_raw(self, request):
+        data = request.get("data", {}).get("ndarray")
+        if data:
+            mult_types_array = np.array(data, dtype=object)
+
+        # Handle other data types as required + your logic
 ```
 
 ## Gunicorn and load
 
-If the wrapped python class is run under [gunicorn](https://gunicorn.org/) then as part of initialization of each gunicorn worker a `load` method will be called on your class if it has it. You should use this method to load and initialise your model. This is important for Tensorflow models which need their session created in each worker process. The [Tensorflow MNIST example](../examples/deep_mnist.html) does this.
+If the wrapped python class is [served under Gunicorn](./python_server) then as
+part of initialization of each gunicorn worker a `load` method will be called
+on your class if it has it.
+You should use this method to load and initialise your model.
+This is important for Tensorflow models which need their session created in
+each worker process.
+The [Tensorflow MNIST example](../examples/deep_mnist.html) does this.
 
-```
+```python
 import tensorflow as tf
 import numpy as np
 import os
@@ -383,56 +371,6 @@ class DeepMnist(object):
         return predictions.astype(np.float64)
 ```
 
-### Single-threaded Flask for REST (experimental)
-
-To run your class single-threaded with Flask set the environment variable `FLASK_SINGLE_THREADED` to 1. This will set the `threaded` parameter of the Flask app to `False`. It is not the optimal setup for most models, but can be useful when your model cannot be made thread-safe like many GPU-based models that deadlock when accessed from multiple threads.
-
-```
-apiVersion: machinelearning.seldon.io/v1alpha2
-kind: SeldonDeployment
-metadata:
-  name: flaskexample
-spec:
-  name: worker
-  predictors:
-  - componentSpecs:
-    - spec:
-        containers:
-        - image: seldonio/mock_classifier:1.0
-          name: classifier
-          env:
-          - name: FLASK_SINGLE_THREADED
-            value: '1'
-        terminationGracePeriodSeconds: 1
-    graph:
-      children: []
-      endpoint:
-        type: REST
-      name: classifier
-      type: MODEL
-    labels:
-      version: v1
-    name: example
-    replicas: 1
-
-```
-
-## Multi-value numpy arrays
-
-By default, when using the data ndarray parameter, the conversion to ndarray (by default) converts all inner types into the same type. With models that may take as input arrays with different value types, you will be able to do so by overriding the `predict_raw` function yourself which gives you access to the raw request, and creating the numpy array as follows:
-
-```
-import numpy as np
-
-class Model:
-    def predict_raw(self, request):
-        data = request.get("data", {}).get("ndarray")
-        if data:
-            mult_types_array = np.array(data, dtype=object)
-
-        # Handle other data types as required + your logic
-```
-
 ## Integer numbers
 
 The `json` package in Python, parses numbers with no decimal part as integers.

diff --git a/doc/source/python/python_server.md b/doc/source/python/python_server.md
@@ -0,0 +1,168 @@
+# Seldon Python Server
+
+To serve your component, Seldon's Python wrapper will use
+[Gunicorn](https://gunicorn.org/) under the hood by default.
+Gunicorn is a high-performing HTTP server for Unix which allows you to easily
+scale your model across multiple worker processes and threads.
+
+.. Note:: 
+  Gunicorn will only handle the horizontal scaling of your model **within the
+  same pod and container**.
+  To learn more about how to scale your model across multiple pod replicas see
+  the :doc:`../graph/scaling` section of the docs.
+
+## Workers
+
+By default, Seldon will only use a **single worker process**.
+However, it's possible to increase this number through the `GUNICORN_WORKERS`
+environment variable.
+This variable can be controlled directly through the `SeldonDeployment` CRD.
+
+For example, to run your model under 4 workers, you could do:
+
+```yaml
+apiVersion: machinelearning.seldon.io/v1
+kind: SeldonDeployment
+metadata:
+  name: gunicorn
+spec:
+  name: worker
+  predictors:
+  - componentSpecs:
+    - spec:
+        containers:
+        - image: seldonio/mock_classifier:1.0
+          name: classifier
+          env:
+          - name: GUNICORN_WORKERS
+            value: '4'
+        terminationGracePeriodSeconds: 1
+    graph:
+      children: []
+      endpoint:
+        type: REST
+      name: classifier
+      type: MODEL
+    labels:
+      version: v1
+    name: example
+    replicas: 1
+
+```
+
+## Threads
+
+By default, Seldon will process your model's incoming requests using a pool of
+**10 threads per worker process**.
+You can increase this number through the `GUNICORN_THREADS` environment
+variable.
+This variable can be controlled directly through the `SeldonDeployment` CRD.
+
+For example, to run your model with 5 threads per worker, you could do:
+
+```yaml
+apiVersion: machinelearning.seldon.io/v1
+kind: SeldonDeployment
+metadata:
+  name: gunicorn
+spec:
+  name: worker
+  predictors:
+  - componentSpecs:
+    - spec:
+        containers:
+        - image: seldonio/mock_classifier:1.0
+          name: classifier
+          env:
+          - name: GUNICORN_THREADS
+            value: '5'
+        terminationGracePeriodSeconds: 1
+    graph:
+      children: []
+      endpoint:
+        type: REST
+      name: classifier
+      type: MODEL
+    labels:
+      version: v1
+    name: example
+    replicas: 1
+
+```
+
+### Disable multithreading
+
+In some cases, you may want to completely disable multithreading.
+To serve your model within a single thread, set the environment variable
+`FLASK_SINGLE_THREADED` to 1.
+This is not the most optimal setup for most models, but can be useful when your
+model cannot be made thread-safe like many GPU-based models that deadlock when
+accessed from multiple threads.
+
+
+```yaml
+apiVersion: machinelearning.seldon.io/v1alpha2
+kind: SeldonDeployment
+metadata:
+  name: flaskexample
+spec:
+  name: worker
+  predictors:
+  - componentSpecs:
+    - spec:
+        containers:
+        - image: seldonio/mock_classifier:1.0
+          name: classifier
+          env:
+          - name: FLASK_SINGLE_THREADED
+            value: '1'
+        terminationGracePeriodSeconds: 1
+    graph:
+      children: []
+      endpoint:
+        type: REST
+      name: classifier
+      type: MODEL
+    labels:
+      version: v1
+    name: example
+    replicas: 1
+
+```
+
+## Development server
+
+While Gunicorn is recommended for production workloads, it's also possible to
+use Flask's built-in development server.
+To enable the development server, you can set the `SELDON_DEBUG` variable to
+`1`.
+
+```yaml
+apiVersion: machinelearning.seldon.io/v1
+kind: SeldonDeployment
+metadata:
+  name: flask-development-server
+spec:
+  name: worker
+  predictors:
+  - componentSpecs:
+    - spec:
+        containers:
+        - image: seldonio/mock_classifier:1.0
+          name: classifier
+          env:
+          - name: SELDON_DEBUG
+            value: '1'
+        terminationGracePeriodSeconds: 1
+    graph:
+      children: []
+      endpoint:
+        type: REST
+      name: classifier
+      type: MODEL
+    labels:
+      version: v1
+    name: example
+    replicas: 1
+
+```
diff --git a/python/seldon_core/app.py b/python/seldon_core/app.py
@@ -0,0 +1,74 @@
+import os
+import logging
+
+from typing import Dict, Union
+from gunicorn.app.base import BaseApplication
+
+logger = logging.getLogger(__name__)
+
+
+def accesslog(log_level: str) -> Union[str, None]:
+    """
+    Enable / disable access log in Gunicorn depending on the log level.
+    """
+
+    if log_level in ["WARNING", "ERROR", "CRITICAL"]:
+        return None
+
+    return "-"
+
+
+def threads(threads: int, single_threaded: bool) -> int:
+    """
+    Number of threads to run in each Gunicorn worker.
+    """
+
+    if single_threaded:
+        return 1
+
+    return threads
+
+
+class StandaloneApplication(BaseApplication):
+    """
+    Standalone Application to run a Flask app in Gunicorn.
+    """
+
+    def __init__(self, app, options: Dict = None):
+        self.application = app
+        self.options = options
+        super().__init__()
+
+    def load_config(self):
+        config = dict(
+            [
+                (key, value)
+                for key, value in self.options.items()
+                if key in self.cfg.settings and value is not None
+            ]
+        )
+        for key, value in config.items():
+            self.cfg.set(key.lower(), value)
+
+    def load(self):
+        return self.application
+
+
+class UserModelApplication(StandaloneApplication):
+    """
+    Gunicorn application to run a Flask app in Gunicorn loading first the
+    user's model.
+    """
+
+    def __init__(self, app, user_object, options: Dict = None):
+        self.user_object = user_object
+        super().__init__(app, options)
+
+    def load(self):
+        logger.debug("LOADING APP %d", os.getpid())
+        try:
+            logger.debug("Calling user load method")
+            self.user_object.load()
+        except (NotImplementedError, AttributeError):
+            logger.debug("No load method in user model")
+        return self.application