Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(docs): improve HPA autoscaling docs #5953

Merged
merged 3 commits into from
Oct 4, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
170 changes: 142 additions & 28 deletions docs-gb/kubernetes/hpa-rps-autoscaling.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,22 @@

# HPA Autoscaling in single-model serving

Learn about how to jointly autoscale model and server replicas based on a metric of inference requests per second (RPS) using HPA, when there is a one-to-one correspondence between models and servers (single-model serving). This will require:

* Having a Seldon Core 2 install that publishes metrics to prometheus (default). In the following, we will assume that prometheus is already installed and configured in the `seldon-monitoring` namespace.
* Installing and configuring [Prometheus Adapter](https://github.com/kubernetes-sigs/prometheus-adapter), which allows prometheus queries on relevant metrics to be published as k8s custom metrics
* Configuring HPA manifests to scale Models and the corresponding Server replicas based on the custom metrics
Learn about how to jointly autoscale model and server replicas based on a metric of inference
requests per second (RPS) using HPA, when there is a one-to-one correspondence between models
and servers (single-model serving). This will require:

* Having a Seldon Core 2 install that publishes metrics to prometheus (default). In the
following, we will assume that prometheus is already installed and configured in the
`seldon-monitoring` namespace.
* Installing and configuring [Prometheus Adapter](https://github.com/kubernetes-sigs/prometheus-adapter),
which allows prometheus queries on relevant metrics to be published as k8s custom metrics
* Configuring HPA manifests to scale Models and the corresponding Server replicas based on the
custom metrics

### Installing and configuring the Prometheus Adapter

The role of the Prometheus Adapter is to expose queries on metrics in prometheus as k8s custom or external metrics. Those can then be accessed by HPA in order to take scaling decisions.
The role of the Prometheus Adapter is to expose queries on metrics in prometheus as k8s custom
or external metrics. Those can then be accessed by HPA in order to take scaling decisions.

To install through helm:

Expand All @@ -19,17 +26,29 @@ helm repo update
helm install --set prometheus.url='http://seldon-monitoring-prometheus' hpa-metrics prometheus-community/prometheus-adapter -n seldon-monitoring
```

In the commands above, we install `prometheus-adapter` as a helm release named `hpa-metrics` in the same namespace as our prometheus install, and point to its service URL (without the port).
In the commands above, we install `prometheus-adapter` as a helm release named `hpa-metrics` in
the same namespace as our prometheus install, and point to its service URL (without the port).

If running prometheus on a different port than the default 9090, you can also pass `--set prometheus.port=[custom_port]` You may inspect all the options available as helm values by running `helm show values prometheus-community/prometheus-adapter`
If running prometheus on a different port than the default 9090, you can also pass `--set
prometheus.port=[custom_port]` You may inspect all the options available as helm values by
running `helm show values prometheus-community/prometheus-adapter`

We now need to configure the adapter to look for the correct prometheus metrics and compute per-model RPS values. On install, the adapter has created a `ConfigMap` in the same namespace as itself, named `[helm_release_name]-prometheus-adapter`. In our case, it will be `hpa-metrics-prometheus-adapter`.
We now need to configure the adapter to look for the correct prometheus metrics and compute
per-model RPS values. On install, the adapter has created a `ConfigMap` in the same namespace as
itself, named `[helm_release_name]-prometheus-adapter`. In our case, it will be
`hpa-metrics-prometheus-adapter`.

We want to overwrite this ConfigMap with the content below (please change the name if your helm release has a different one). The manifest contains embedded documentation, highlighting how we match the `seldon_model_infer_total` metric in Prometheus, compute a rate via a `metricsQuery` and expose this to k8s as the `infer_rps` metric, on a per (model, namespace) basis.
We want to overwrite this ConfigMap with the content below (please change the name if your helm
release has a different one). The manifest contains embedded documentation, highlighting how we
match the `seldon_model_infer_total` metric in Prometheus, compute a rate via a `metricsQuery`
and expose this to k8s as the `infer_rps` metric, on a per (model, namespace) basis.

Other aggregations on per (server, namespace) and (pod, namespace) are also exposed and may be used in HPA, but we will focus on the (model, namespace) aggregation in the examples below.
Other aggregations on per (server, namespace) and (pod, namespace) are also exposed and may be
used in HPA, but we will focus on the (model, namespace) aggregation in the examples below.

You may want to modify some of the settings to match the prometheus query that you typically use for RPS metrics. For example, the `metricsQuery` below computes the RPS by calling [`rate()`](https://prometheus.io/docs/prometheus/latest/querying/functions/#rate) with a 1 minute window.
You may want to modify some of the settings to match the prometheus query that you typically use
for RPS metrics. For example, the `metricsQuery` below computes the RPS by calling [`rate()`]
(https://prometheus.io/docs/prometheus/latest/querying/functions/#rate) with a 1 minute window.

````yaml
apiVersion: v1
Expand Down Expand Up @@ -125,7 +144,8 @@ data:
)
````

Apply the config, and restart the prometheus adapter deployment (this restart is required so that prometheus-adapter picks up the new config):
Apply the config, and restart the prometheus adapter deployment (this restart is required so
that prometheus-adapter picks up the new config):

```sh
# Apply prometheus adapter config
Expand All @@ -134,10 +154,15 @@ kubectl apply -f prometheus-adapter.config.yaml
kubectl rollout restart deployment hpa-metrics-prometheus-adapter -n seldon-monitoring
```

In order to test that the prometheus adapter config works and everything is set up correctly, you can issue raw kubectl requests against the custom metrics API, as described below.
In order to test that the prometheus adapter config works and everything is set up correctly,
you can issue raw kubectl requests against the custom metrics API, as described below.

{% hint style="info" %}
If no inference requests were issued towards any model in the Seldon install, the metrics configured above will not be available in prometheus, and thus will also not appear when checking via the commands below. Therefore, please first run some inference requests towards a sample model to ensure that the metrics are available — this is only required for the testing of the install.
If no inference requests were issued towards any model in the Seldon install, the metrics
configured above will not be available in prometheus, and thus will also not appear when
checking via the commands below. Therefore, please first run some inference requests towards a
sample model to ensure that the metrics are available — this is only required for the testing of
the install.
{% endhint %}

**Testing the prometheus-adapter install using the custom metrics API**
Expand Down Expand Up @@ -174,9 +199,13 @@ kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/*/metrics/infer

### Configuring HPA manifests

For every (Model, Server) pair you want to autoscale, you need to apply 2 HPA manifests based on the same metric: one scaling the Model, the other the Server. The example below only works if the mapping between Models and Servers is 1-to-1 (i.e no multi-model serving).
For every (Model, Server) pair you want to autoscale, you need to apply 2 HPA manifests based on
the same metric: one scaling the Model, the other the Server. The example below only works if
the mapping between Models and Servers is 1-to-1 (i.e no multi-model serving).

Consider a model named `irisa0` with the following manifest. Please note we don’t set `minReplicas/maxReplicas` this is in order to disable the seldon-specific autoscaling so that it doesn’t interact with HPA.
Consider a model named `irisa0` with the following manifest. Please note we don’t set
`minReplicas/maxReplicas` this is in order to disable the seldon-specific autoscaling so that it
doesn’t interact with HPA.

```yaml
apiVersion: mlops.seldon.io/v1alpha1
Expand All @@ -192,7 +221,8 @@ spec:
storageUri: gs://seldon-models/testing/iris1
```

Let’s scale this model when it is deployed on a server named `mlserver`, with a target RPS **per replica** of 3 RPS (higher RPS would trigger scale-up, lower would trigger scale-down):
Let’s scale this model when it is deployed on a server named `mlserver`, with a target RPS **per
replica** of 3 RPS (higher RPS would trigger scale-up, lower would trigger scale-down):

```yaml
apiVersion: autoscaling/v2
Expand Down Expand Up @@ -246,29 +276,111 @@ spec:
averageValue: 3
```

In the two HPA manifests above, the scaling metric is exactly the same, and uses the exact same parameters: this is to ensure that both the Models and the Servers are scaled up/down at approximately the same time. Small variations in the scale-up time are expected because each HPA samples the metrics independently, at regular intervals. If a Model gets scaled up slightly before its corresponding Server, the model is currently marked with the condition ModelReady "Status: False" with a "ScheduleFailed" message until new Server replicas become available. However, the existing replicas of that model remain available and will continue to serve inference load.
In the preceding HPA manifests, the scaling metric is exactly the same, and uses the exact same
parameters: this is to ensure that both the Models and the Servers are scaled up/down at
approximately the same time. Small variations in the scale-up time are expected because each HPA
samples the metrics independently, at regular intervals.

{% hint style="info" %}
If a Model gets scaled up slightly before its corresponding Server, the model is currently
marked with the condition ModelReady "Status: False" with a "ScheduleFailed" message until new
Server replicas become available. However, the existing replicas of that model remain available
and will continue to serve inference load.
{% endhint %}

In order to ensure similar scaling behaviour between Models and Servers, the number of
`minReplicas` and `maxReplicas`, as well as any other configured scaling policies should be kept
in sync across the HPA for the model and the server.

#### Details on custom metrics of type Object

The HPA manifests use metrics of type "Object" that fetch the data used in scaling
decisions by querying k8s metrics associated with a particular k8s object. The endpoints that
HPA uses for fetching those metrics are the same ones that were tested in the previous section
using `kubectl get --raw ...`. Because you have configured the Prometheus Adapter to expose those
k8s metrics based on queries to Prometheus, a mapping exists between the information contained
in the HPA Object metric definition and the actual query that is executed against Prometheus.
This section aims to give more details on how this mapping works.

In our example, the `metric.name:infer_rps` gets mapped to the `seldon_model_infer_total` metric
on the prometheus side, based on the configuration in the `name` section of the Prometheus
Adapter ConfigMap. The prometheus metric name is then used to fill in the `<<.Series>>` template
in the query (`metricsQuery` in the same ConfigMap).

Then, the information provided in the `describedObject` is used within the Prometheus query to
select the right aggregations of the metric. For the RPS metric used to scale the Model
(and the Server because of the 1-1 mapping), it makes sense to compute the aggregate RPS across
all the replicas of a given model, so the `describedObject` references a specific Model CR.

In order to ensure similar scaling behaviour between Models and Servers, the number of minReplicas and maxReplicas defined in the HPA, as well as other scaling policies configured should be kept in sync across the HPA for the model and the server.
However, in the general case, the `describedObject` does not need to be a Model. Any k8s object
listed in the `resources` section of the Prometheus Adapter ConfigMap may be used. The Prometheus
label associated with the object kind fills in the `<<.GroupBy>>` template, while the name gets
used as part of the `<<.LabelMatchers>>`. For example:

* If the described object is `{ kind: Namespace, name: seldon-mesh }`, then the Prometheus
query template configured in our example would be transformed into:

```
sum by (namespace) (
rate (
seldon_model_infer_total{namespace="seldon-mesh"}[1m]
)
)
```

* If the described object is not a namespace (for example, `{ kind: Pod, name: mlserver-0 }`)
then the query will be passed the label describing the object, alongside an additional label
identifying the namespace where the HPA manifest resides in.:

```
sum by (pod) (
rate (
seldon_model_infer_total{pod="mlserver-0", namespace="seldon-mesh"}[1m]
)
)
```

Please note that you **must** use a `target.type` of `AverageValue`. The value given in `averageValue` is the threshold RPS per replica, and the new (scaled) number of replicas is computed by HPA according to the following formula:
For the `target` of the Object metric you **must** use a `type` of `AverageValue`. The value
given in `averageValue` represents the per replica RPS scaling threshold of the `scaleTargetRef`
object (either a Model or a Server in our case), with the target number of replicas being
computed by HPA according to the following formula:

$$\texttt{targetReplicas} = \frac{\texttt{infer\_rps}}{\texttt{thresholdPerReplicaRPS}}$$

{% hint style="info" %}
Attempting other target types will not work under the current Seldon Core v2 setup, because they use the number of active Pods associated with the Model CR (i.e. the associated Server pods) in the `targetReplicas` computation. However, this also means that this set of pods becomes "owned" by the Model HPA. Once a pod is owned by a given HPA it is not available for other HPAs to use, so we would no longer be able to scale the Server CRs using HPA.
Attempting other target types does not work under the current Seldon Core 2 setup, because they
use the number of active Pods associated with the Model CR (i.e. the associated Server pods) in
the `targetReplicas` computation. However, this also means that this set of pods becomes "owned"
by the Model HPA. Once a pod is owned by a given HPA it is not available for other HPAs to use,
so we would no longer be able to scale the Server CRs using HPA.
{% endhint %}

**Advanced settings:**
#### Advanced settings

* Filtering metrics by other labels on the prometheus metric
* Filtering metrics by additional labels on the prometheus metric

The prometheus metric from which the model RPS is computed has the following labels:

```yaml
seldon_model_infer_total{code="200", container="agent", endpoint="metrics", instance="10.244.0.39:9006", job="seldon-mesh/agent", method_type="rest", model="irisa0", model_internal="irisa0_1", namespace="seldon-mesh", pod="mlserver-0", server="mlserver", server_replica="0"}
seldon_model_infer_total{
code="200", 
container="agent", 
endpoint="metrics", 
instance="10.244.0.39:9006", 
job="seldon-mesh/agent", 
method_type="rest", 
model="irisa0", 
model_internal="irisa0_1", 
namespace="seldon-mesh", 
pod="mlserver-0", 
server="mlserver", 
server_replica="0"
}
```

If we wanted the scaling metric to be computed based on inferences with a particular value for those metrics, we can add this in the HPA metric config, as in the example below (targeting `method_type="rest"`):
If you want the scaling metric to be computed based on inferences with a particular value
for any of those labels, you can add this in the HPA metric config, as in the example
(targeting `method_type="rest"`):

```yaml
metrics:
Expand All @@ -287,6 +399,8 @@ Attempting other target types will not work under the current Seldon Core v2 set
type: AverageValue
averageValue: "3"
```
* Customise scale-up / scale-down rate & properties by using scaling policies as described in the [HPA scaling policies docs](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#configurable-scaling-behavior)
* Customise scale-up / scale-down rate & properties by using scaling policies as described in
the [HPA scaling policies docs](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#configurable-scaling-behavior)

For more resources, please consult the [HPA docs](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/) and the [HPA walkthrough](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/)
* For more resources, please consult the [HPA docs](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/)
and the [HPA walkthrough](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/)
Loading