Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K8S: Multiple HPAs cannot scale on exact same External Metric query within the same cluster #6388

Closed
jodylent opened this issue Sep 16, 2020 · 2 comments

Comments

@jodylent
Copy link

Output of the info page (if this is a bug)

N/A -- this is explicitly written into the Agent source code

Describe what you expected:

BACKGROUND

  1. At $employer we run various services distributed evenly across 2 or more AWS Availability Zones per region. Generally we prefer to scale these evenly across AZ's, due to edge and internal load balancing configurations.
  2. Kubernetes 1.18 supports pod topology to do this, but AWS EKS support for 1.18 isn't available yet docs

When testing various balancing strategies, we attempted to create N (ReplicaSets + HPAs) for a test service, one per AZ. Each HPA would scale on the same metric query, with the idea that $test_service would thus scale evenly across all AZ's within an AWS Region.

When testing with a single (ReplicaSet + HPA), scaling worked perfectly, just as expected:

$ kubectl describe hpa test-2a

<... trimmed output ...>
Reference:                                                    ReplicaSet/test-2a
Metrics:                                                      ( current / target )
  "service_state.weight_utilization" (target average value):  23m / 1m
Min replicas:                                                 1
Max replicas:                                                 5
ReplicaSet pods:                                              5 current / 5 desired

Describe what happened:

When testing with 2 or more (ReplicaSet + HPA), scaling failed to occur:

$ kubectl describe hpa test-2a

<... trimmed output ...>
Reference:                                                    ReplicaSet/test-2a
Metrics:                                                      ( current / target )
  "service_state.weight_utilization" (target average value):  <unknown> / 2
Min replicas:                                                 1
Max replicas:                                                 5
ReplicaSet pods:                                              1 current / 0 desired
Conditions:
  Type           Status  Reason                   Message
  ----           ------  ------                   -------
  AbleToScale    True    SucceededGetScale        the HPA controller was able to get the target's current scale
  ScalingActive  False   FailedGetExternalMetric  the HPA was unable to compute the replica count: unable to get external metric default/service_state.weight_utilization/&LabelSelector{MatchLabels:map[string]string{environment: preprod,service: testservice,},MatchExpressions:[]LabelSelectorRequirement{},}: no metrics returned from external metrics API
Events:
  Type     Reason                                          Age                   From                       Message
  ----     ------                                          ----                  ----                       -------
  Normal   Autoscaler is now handled by the Cluster-Agent  29m                   datadog-cluster-agent      
  Warning  FailedComputeMetricsReplicas                    26m (x12 over 29m)    horizontal-pod-autoscaler  invalid metrics (1 invalid out of 1), first error is: failed to get service_state.weight_utilization external metric: unable to get external metric default/service_state.weight_utilization/&LabelSelector{MatchLabels:map[string]string{environment: preprod,service: testservice,},MatchExpressions:[]LabelSelectorRequirement{},}: no metrics returned from external metrics API
  Warning  FailedGetExternalMetric                         13m (x61 over 29m)    horizontal-pod-autoscaler  unable to get external metric default/service_state.weight_utilization/&LabelSelector{MatchLabels:map[string]string{environment: preprod,service: testservice,},MatchExpressions:[]LabelSelectorRequirement{},}: no metrics returned from external metrics API
  Warning  FailedGetExternalMetric                         3m46s (x33 over 11m)  horizontal-pod-autoscaler  unable to get external metric default/service_state.weight_utilization/&LabelSelector{MatchLabels:map[string]string{environment: preprod,service: testservice,},MatchExpressions:[]LabelSelectorRequirement{},}: no metrics returned from external metrics API

When we dug further, we found the following cause in our logs via kubectl logs datadog-cluster-agent:

Multiple Series found for query: avg:service_state.weight_utilization{environment:preprod,service:testservice}}.rollup(30). Please change your query to return a single Serie. Results will be flagged as invalid

Example entries:

2020-09-16 18:32:47 UTC | CLUSTER | WARN | (pkg/util/kubernetes/autoscalers/datadogexternal.go:102 in queryDatadogExternal) | Multiple Series found for query: avg:service_state.weight_utilization{environment:preprod,service:testservice}}.rollup(30). Please change your query to return a single Serie. Results will be flagged as invalid

2020-09-16 18:33:16 UTC | CLUSTER | WARN | (pkg/util/kubernetes/autoscalers/datadogexternal.go:102 in queryDatadogExternal) | Multiple Series found for query: avg:service_state.weight_utilization{environment:preprod,service:testservice}}.rollup(30). Please change your query to return a single Serie. Results will be flagged as invalid

< many more of the same>

This led us to simply read the source, and sure enough, the queryDatadogExternal func expects queries for a single cluster to be unique:

// https://github.com/DataDog/datadog-agent/blob/master/pkg/util/kubernetes/autoscalers/datadogexternal.go#L65
// queryDatadogExternal converts the metric name and labels from the Ref format into a Datadog metric.
// It returns the last value for a bucket of 5 minutes,
func (p *Processor) queryDatadogExternal(ddQueries []string, bucketSize int64) (map[string]Point, error) {
    ...


// https://github.com/DataDog/datadog-agent/blob/master/pkg/util/kubernetes/autoscalers/datadogexternal.go#L98

        // Check if we already have a Serie result for this query. We expect query to result in a single Serie
        // Otherwise we are not able to determine which value we should take for Autoscaling
        if existingPoint, found := processedMetrics[ddQueries[queryIndex]]; found {
            if existingPoint.Valid {
                log.Warnf("Multiple Series found for query: %s. Please change your query to return a single Serie. Results will be flagged as invalid", ddQueries[queryIndex])
                existingPoint.Valid = false
                existingPoint.Timestamp = time.Now().Unix()
                processedMetrics[ddQueries[queryIndex]] = existingPoint
            }
            continue
        }

Implications

  1. A single cluster may have EXACTLY one HorizontalPodAutoscaler configured with a given exact Metric query
  2. Fine-grained control over AZ distribution of Replicas requires emitting AZ-specific tags on Kubernetes 1.17 and below

Steps to reproduce the issue:

  1. Create two or more ReplicaSets or Deployments (any scalable Controller will do)
  2. Create an HPA manifest for each, and apply. As an example:
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: test-us-west-2a  # CHANGEME for us-west-2b or any other AZ in your second manifest
  namespace: default
spec:
  minReplicas: 1
  maxReplicas: 5
  scaleTargetRef:
    apiVersion: apps/v1
    kind: ReplicaSet
    name: test-us-west-2a  # CHANGEME
  metrics:
  - type: External
    external:
      # I've chosen an existing preproduction metric with values between 0.0 - 1.0
      # To prove that HPA scales on arbitrary DD metrics
      metricName: service_state.weight_utilization
      metricSelector:
        # Label name/value corresponds to DD Tag name/value
        matchLabels:
          environment: preprod
          service: <an existing service>    # CHANGEME
          # The plot thickens: if you add an AZ-specific label, the queries become unique,
          # generating a single series, and thus "working"
          # availability-zone: us-west-2a
      targetAverageValue: 0.001  # set to max out replicas, exact value unneeded to reproduce

Additional environment details (Operating System, Cloud provider, etc):

N/A

@vboulineau
Copy link
Contributor

Hi @jodylent,

We've fixed this with #6412
It will be released with upcoming Cluster Agent 1.9.

It's also working properly if you use DatadogMetric feature:
https://docs.datadoghq.com/agent/cluster_agent/external_metrics/#autoscaling-with-custom-queries-using-datadogmetric-crd-cluster-agent-v1-7-0

@jodylent
Copy link
Author

Fixed by #6412

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants