K8S: Multiple HPAs cannot scale on exact same External Metric query within the same cluster #6388

jodylent · 2020-09-16T20:41:51Z

Output of the info page (if this is a bug)

N/A -- this is explicitly written into the Agent source code

Describe what you expected:

BACKGROUND

At $employer we run various services distributed evenly across 2 or more AWS Availability Zones per region. Generally we prefer to scale these evenly across AZ's, due to edge and internal load balancing configurations.
Kubernetes 1.18 supports pod topology to do this, but AWS EKS support for 1.18 isn't available yet docs

When testing various balancing strategies, we attempted to create N (ReplicaSets + HPAs) for a test service, one per AZ. Each HPA would scale on the same metric query, with the idea that $test_service would thus scale evenly across all AZ's within an AWS Region.

When testing with a single (ReplicaSet + HPA), scaling worked perfectly, just as expected:

$ kubectl describe hpa test-2a

<... trimmed output ...>
Reference:                                                    ReplicaSet/test-2a
Metrics:                                                      ( current / target )
  "service_state.weight_utilization" (target average value):  23m / 1m
Min replicas:                                                 1
Max replicas:                                                 5
ReplicaSet pods:                                              5 current / 5 desired

Describe what happened:

When testing with 2 or more (ReplicaSet + HPA), scaling failed to occur:

$ kubectl describe hpa test-2a

<... trimmed output ...>
Reference:                                                    ReplicaSet/test-2a
Metrics:                                                      ( current / target )
  "service_state.weight_utilization" (target average value):  <unknown> / 2
Min replicas:                                                 1
Max replicas:                                                 5
ReplicaSet pods:                                              1 current / 0 desired
Conditions:
  Type           Status  Reason                   Message
  ----           ------  ------                   -------
  AbleToScale    True    SucceededGetScale        the HPA controller was able to get the target's current scale
  ScalingActive  False   FailedGetExternalMetric  the HPA was unable to compute the replica count: unable to get external metric default/service_state.weight_utilization/&LabelSelector{MatchLabels:map[string]string{environment: preprod,service: testservice,},MatchExpressions:[]LabelSelectorRequirement{},}: no metrics returned from external metrics API
Events:
  Type     Reason                                          Age                   From                       Message
  ----     ------                                          ----                  ----                       -------
  Normal   Autoscaler is now handled by the Cluster-Agent  29m                   datadog-cluster-agent      
  Warning  FailedComputeMetricsReplicas                    26m (x12 over 29m)    horizontal-pod-autoscaler  invalid metrics (1 invalid out of 1), first error is: failed to get service_state.weight_utilization external metric: unable to get external metric default/service_state.weight_utilization/&LabelSelector{MatchLabels:map[string]string{environment: preprod,service: testservice,},MatchExpressions:[]LabelSelectorRequirement{},}: no metrics returned from external metrics API
  Warning  FailedGetExternalMetric                         13m (x61 over 29m)    horizontal-pod-autoscaler  unable to get external metric default/service_state.weight_utilization/&LabelSelector{MatchLabels:map[string]string{environment: preprod,service: testservice,},MatchExpressions:[]LabelSelectorRequirement{},}: no metrics returned from external metrics API
  Warning  FailedGetExternalMetric                         3m46s (x33 over 11m)  horizontal-pod-autoscaler  unable to get external metric default/service_state.weight_utilization/&LabelSelector{MatchLabels:map[string]string{environment: preprod,service: testservice,},MatchExpressions:[]LabelSelectorRequirement{},}: no metrics returned from external metrics API

When we dug further, we found the following cause in our logs via kubectl logs datadog-cluster-agent:

Multiple Series found for query: avg:service_state.weight_utilization{environment:preprod,service:testservice}}.rollup(30). Please change your query to return a single Serie. Results will be flagged as invalid

Example entries:

2020-09-16 18:32:47 UTC | CLUSTER | WARN | (pkg/util/kubernetes/autoscalers/datadogexternal.go:102 in queryDatadogExternal) | Multiple Series found for query: avg:service_state.weight_utilization{environment:preprod,service:testservice}}.rollup(30). Please change your query to return a single Serie. Results will be flagged as invalid

2020-09-16 18:33:16 UTC | CLUSTER | WARN | (pkg/util/kubernetes/autoscalers/datadogexternal.go:102 in queryDatadogExternal) | Multiple Series found for query: avg:service_state.weight_utilization{environment:preprod,service:testservice}}.rollup(30). Please change your query to return a single Serie. Results will be flagged as invalid

< many more of the same>

This led us to simply read the source, and sure enough, the queryDatadogExternal func expects queries for a single cluster to be unique:

// https://github.com/DataDog/datadog-agent/blob/master/pkg/util/kubernetes/autoscalers/datadogexternal.go#L65
// queryDatadogExternal converts the metric name and labels from the Ref format into a Datadog metric.
// It returns the last value for a bucket of 5 minutes,
func (p *Processor) queryDatadogExternal(ddQueries []string, bucketSize int64) (map[string]Point, error) {
    ...


// https://github.com/DataDog/datadog-agent/blob/master/pkg/util/kubernetes/autoscalers/datadogexternal.go#L98

        // Check if we already have a Serie result for this query. We expect query to result in a single Serie
        // Otherwise we are not able to determine which value we should take for Autoscaling
        if existingPoint, found := processedMetrics[ddQueries[queryIndex]]; found {
            if existingPoint.Valid {
                log.Warnf("Multiple Series found for query: %s. Please change your query to return a single Serie. Results will be flagged as invalid", ddQueries[queryIndex])
                existingPoint.Valid = false
                existingPoint.Timestamp = time.Now().Unix()
                processedMetrics[ddQueries[queryIndex]] = existingPoint
            }
            continue
        }

Implications

A single cluster may have EXACTLY one HorizontalPodAutoscaler configured with a given exact Metric query
Fine-grained control over AZ distribution of Replicas requires emitting AZ-specific tags on Kubernetes 1.17 and below

Steps to reproduce the issue:

Create two or more ReplicaSets or Deployments (any scalable Controller will do)
Create an HPA manifest for each, and apply. As an example:

apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: test-us-west-2a  # CHANGEME for us-west-2b or any other AZ in your second manifest
  namespace: default
spec:
  minReplicas: 1
  maxReplicas: 5
  scaleTargetRef:
    apiVersion: apps/v1
    kind: ReplicaSet
    name: test-us-west-2a  # CHANGEME
  metrics:
  - type: External
    external:
      # I've chosen an existing preproduction metric with values between 0.0 - 1.0
      # To prove that HPA scales on arbitrary DD metrics
      metricName: service_state.weight_utilization
      metricSelector:
        # Label name/value corresponds to DD Tag name/value
        matchLabels:
          environment: preprod
          service: <an existing service>    # CHANGEME
          # The plot thickens: if you add an AZ-specific label, the queries become unique,
          # generating a single series, and thus "working"
          # availability-zone: us-west-2a
      targetAverageValue: 0.001  # set to max out replicas, exact value unneeded to reproduce

Additional environment details (Operating System, Cloud provider, etc):

N/A

The text was updated successfully, but these errors were encountered:

vboulineau · 2020-09-21T07:51:35Z

Hi @jodylent,

We've fixed this with #6412
It will be released with upcoming Cluster Agent 1.9.

It's also working properly if you use DatadogMetric feature:
https://docs.datadoghq.com/agent/cluster_agent/external_metrics/#autoscaling-with-custom-queries-using-datadogmetric-crd-cluster-agent-v1-7-0

jodylent · 2021-03-24T20:51:24Z

Fixed by #6412

khewonc added the team/containers label Sep 28, 2020

jodylent closed this as completed Mar 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

K8S: Multiple HPAs cannot scale on exact same External Metric query within the same cluster #6388

K8S: Multiple HPAs cannot scale on exact same External Metric query within the same cluster #6388

jodylent commented Sep 16, 2020

vboulineau commented Sep 21, 2020

jodylent commented Mar 24, 2021

K8S: Multiple HPAs cannot scale on exact same External Metric query within the same cluster #6388

K8S: Multiple HPAs cannot scale on exact same External Metric query within the same cluster #6388

Comments

jodylent commented Sep 16, 2020

vboulineau commented Sep 21, 2020

jodylent commented Mar 24, 2021