You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
N/A -- this is explicitly written into the Agent source code
Describe what you expected:
BACKGROUND
At $employer we run various services distributed evenly across 2 or more AWS Availability Zones per region. Generally we prefer to scale these evenly across AZ's, due to edge and internal load balancing configurations.
Kubernetes 1.18 supports pod topology to do this, but AWS EKS support for 1.18 isn't available yet docs
When testing various balancing strategies, we attempted to create N (ReplicaSets + HPAs) for a test service, one per AZ. Each HPA would scale on the same metric query, with the idea that $test_service would thus scale evenly across all AZ's within an AWS Region.
When testing with a single (ReplicaSet + HPA), scaling worked perfectly, just as expected:
$ kubectl describe hpa test-2a
<... trimmed output ...>
Reference: ReplicaSet/test-2a
Metrics: ( current / target )
"service_state.weight_utilization" (target average value): 23m / 1m
Min replicas: 1
Max replicas: 5
ReplicaSet pods: 5 current / 5 desired
Describe what happened:
When testing with 2 or more (ReplicaSet + HPA), scaling failed to occur:
$ kubectl describe hpa test-2a
<... trimmed output ...>
Reference: ReplicaSet/test-2a
Metrics: ( current / target )
"service_state.weight_utilization" (target average value): <unknown> / 2
Min replicas: 1
Max replicas: 5
ReplicaSet pods: 1 current / 0 desired
Conditions:
Type Status Reason Message
---- ------ ------ -------
AbleToScale True SucceededGetScale the HPA controller was able to get the target's current scale
ScalingActive False FailedGetExternalMetric the HPA was unable to compute the replica count: unable to get external metric default/service_state.weight_utilization/&LabelSelector{MatchLabels:map[string]string{environment: preprod,service: testservice,},MatchExpressions:[]LabelSelectorRequirement{},}: no metrics returned from external metrics API
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Autoscaler is now handled by the Cluster-Agent 29m datadog-cluster-agent
Warning FailedComputeMetricsReplicas 26m (x12 over 29m) horizontal-pod-autoscaler invalid metrics (1 invalid out of 1), first error is: failed to get service_state.weight_utilization external metric: unable to get external metric default/service_state.weight_utilization/&LabelSelector{MatchLabels:map[string]string{environment: preprod,service: testservice,},MatchExpressions:[]LabelSelectorRequirement{},}: no metrics returned from external metrics API
Warning FailedGetExternalMetric 13m (x61 over 29m) horizontal-pod-autoscaler unable to get external metric default/service_state.weight_utilization/&LabelSelector{MatchLabels:map[string]string{environment: preprod,service: testservice,},MatchExpressions:[]LabelSelectorRequirement{},}: no metrics returned from external metrics API
Warning FailedGetExternalMetric 3m46s (x33 over 11m) horizontal-pod-autoscaler unable to get external metric default/service_state.weight_utilization/&LabelSelector{MatchLabels:map[string]string{environment: preprod,service: testservice,},MatchExpressions:[]LabelSelectorRequirement{},}: no metrics returned from external metrics API
When we dug further, we found the following cause in our logs via kubectl logs datadog-cluster-agent:
Multiple Series found for query: avg:service_state.weight_utilization{environment:preprod,service:testservice}}.rollup(30). Please change your query to return a single Serie. Results will be flagged as invalid
Example entries:
2020-09-16 18:32:47 UTC | CLUSTER | WARN | (pkg/util/kubernetes/autoscalers/datadogexternal.go:102 in queryDatadogExternal) | Multiple Series found for query: avg:service_state.weight_utilization{environment:preprod,service:testservice}}.rollup(30). Please change your query to return a single Serie. Results will be flagged as invalid
2020-09-16 18:33:16 UTC | CLUSTER | WARN | (pkg/util/kubernetes/autoscalers/datadogexternal.go:102 in queryDatadogExternal) | Multiple Series found for query: avg:service_state.weight_utilization{environment:preprod,service:testservice}}.rollup(30). Please change your query to return a single Serie. Results will be flagged as invalid
< many more of the same>
This led us to simply read the source, and sure enough, the queryDatadogExternal func expects queries for a single cluster to be unique:
// https://github.com/DataDog/datadog-agent/blob/master/pkg/util/kubernetes/autoscalers/datadogexternal.go#L65// queryDatadogExternal converts the metric name and labels from the Ref format into a Datadog metric.// It returns the last value for a bucket of 5 minutes,func (p*Processor) queryDatadogExternal(ddQueries []string, bucketSizeint64) (map[string]Point, error) {
...// https://github.com/DataDog/datadog-agent/blob/master/pkg/util/kubernetes/autoscalers/datadogexternal.go#L98// Check if we already have a Serie result for this query. We expect query to result in a single Serie// Otherwise we are not able to determine which value we should take for AutoscalingifexistingPoint, found:=processedMetrics[ddQueries[queryIndex]]; found {
ifexistingPoint.Valid {
log.Warnf("Multiple Series found for query: %s. Please change your query to return a single Serie. Results will be flagged as invalid", ddQueries[queryIndex])
existingPoint.Valid=falseexistingPoint.Timestamp=time.Now().Unix()
processedMetrics[ddQueries[queryIndex]] =existingPoint
}
continue
}
Implications
A single cluster may have EXACTLY one HorizontalPodAutoscaler configured with a given exact Metric query
Fine-grained control over AZ distribution of Replicas requires emitting AZ-specific tags on Kubernetes 1.17 and below
Steps to reproduce the issue:
Create two or more ReplicaSets or Deployments (any scalable Controller will do)
Create an HPA manifest for each, and apply. As an example:
apiVersion: autoscaling/v2beta1kind: HorizontalPodAutoscalermetadata:
name: test-us-west-2a # CHANGEME for us-west-2b or any other AZ in your second manifestnamespace: defaultspec:
minReplicas: 1maxReplicas: 5scaleTargetRef:
apiVersion: apps/v1kind: ReplicaSetname: test-us-west-2a # CHANGEMEmetrics:
- type: Externalexternal:
# I've chosen an existing preproduction metric with values between 0.0 - 1.0# To prove that HPA scales on arbitrary DD metricsmetricName: service_state.weight_utilizationmetricSelector:
# Label name/value corresponds to DD Tag name/valuematchLabels:
environment: preprodservice: <an existing service> # CHANGEME# The plot thickens: if you add an AZ-specific label, the queries become unique,# generating a single series, and thus "working"# availability-zone: us-west-2atargetAverageValue: 0.001# set to max out replicas, exact value unneeded to reproduce
Output of the info page (if this is a bug)
Describe what you expected:
BACKGROUND
$employer
we run various services distributed evenly across 2 or more AWS Availability Zones per region. Generally we prefer to scale these evenly across AZ's, due to edge and internal load balancing configurations.When testing various balancing strategies, we attempted to create N (ReplicaSets + HPAs) for a test service, one per AZ. Each HPA would scale on the same metric query, with the idea that
$test_service
would thus scale evenly across all AZ's within an AWS Region.When testing with a single (ReplicaSet + HPA), scaling worked perfectly, just as expected:
Describe what happened:
When testing with 2 or more (ReplicaSet + HPA), scaling failed to occur:
When we dug further, we found the following cause in our logs via
kubectl logs datadog-cluster-agent
:Example entries:
This led us to simply read the source, and sure enough, the queryDatadogExternal func expects queries for a single cluster to be unique:
Implications
Steps to reproduce the issue:
Additional environment details (Operating System, Cloud provider, etc):
N/A
The text was updated successfully, but these errors were encountered: