Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datadog scaler] Scaler returns fillValue if the last data point of query is null #3906

Closed
dogzzdogzz opened this issue Nov 23, 2022 · 8 comments · Fixed by #3954 or kedacore/keda-docs#1002
Labels
bug Something isn't working

Comments

@dogzzdogzz
Copy link
Contributor

Report

I reported similar issue 3448 before, the root cause was null latest data point to cause the exception. Recently I finally have time to test the query again. This time, I make sure all data points have data and with the same number of data points within the period

Query 1: sum:trace.express.request.hits{service:foo,env:bar}.as_rate() , keda can get the metrics of this query without any problem

# curl result of query 1
{
  "status": "ok",
  "resp_version": 1,
  "series": [
    {
      "end": 1669106969000,
      "attributes": {},
      "metric": "trace.express.request.hits",
      "interval": 10,
      "tag_set": [],
      "start": 1669106910000,
      "length": 6,
      "query_index": 0,
      "aggr": "sum",
      "scope": "env:bar,service:foo",
      "pointlist": [
        [
          1669106910000,
          417.3
        ],
        [
          1669106920000,
          465.4
        ],
        [
          1669106930000,
          447.1
        ],
        [
          1669106940000,
          440.9
        ],
        [
          1669106950000,
          440.8
        ],
        [
          1669106960000,
          748.5000000000001
        ]
      ],
      "expression": "sum:trace.express.request.hits{env:bar,service:foo}.as_rate()",
      "unit": [
        {
          "family": "cache",
          "scale_factor": 1,
          "name": "hit",
          "short_name": null,
          "plural": "hits",
          "id": 39
        },
        {
          "family": "time",
          "scale_factor": 1,
          "name": "second",
          "short_name": "s",
          "plural": "seconds",
          "id": 11
        }
      ],
      "display_name": "trace.express.request.hits"
    }
  ],
  "to_date": 1669106966000,
  "query": "sum:trace.express.request.hits{service:foo,env:bar}.as_rate()",
  "message": "",
  "res_type": "time_series",
  "times": [],
  "from_date": 1669106906000,
  "group_by": [],
  "values": []
}

Query 2: avg:kubernetes.cpu.requests{service:foo,env:bar}.rollup(10) , keda can get the metrics of this query without any problem```

# curl result of query 2
{
  "status": "ok",
  "resp_version": 1,
  "series": [
    {
      "end": 1669106969000,
      "attributes": {},
      "metric": "kubernetes.cpu.requests",
      "interval": 10,
      "tag_set": [],
      "start": 1669106910000,
      "length": 6,
      "query_index": 0,
      "aggr": "avg",
      "scope": "env:bar,service:foo",
      "pointlist": [
        [
          1669106910000,
          0.11999999731779099
        ],
        [
          1669106920000,
          0.11999999731779099
        ],
        [
          1669106930000,
          0.11999999731779099
        ],
        [
          1669106940000,
          0.11999999731779099
        ],
        [
          1669106950000,
          0.11999999731779099
        ],
        [
          1669106960000,
          0.11999999731779099
        ]
      ],
      "expression": "avg:kubernetes.cpu.requests{env:bar,service:foo}.rollup(10)",
      "unit": [
        {
          "family": "cpu",
          "scale_factor": 1,
          "name": "core",
          "short_name": null,
          "plural": "cores",
          "id": 31
        },
        null
      ],
      "display_name": "kubernetes.cpu.requests"
    }
  ],
  "to_date": 1669106966000,
  "query": "avg:kubernetes.cpu.requests{service:foo,env:bar}.rollup(10)",
  "message": "",
  "res_type": "time_series",
  "times": [],
  "from_date": 1669106906000,
  "group_by": [],
  "values": []
}

If I combine above two queries into "query 1 / query 2"
Query 3: sum:trace.express.request.hits{service:foo,env:bar}.as_rate()/avg:kubernetes.cpu.requests{service:foo,env:bar}.rollup(10)
The curl result looks ok but KEDA is failed to get the metrics

# curl result if query 3
{
  "status": "ok",
  "resp_version": 1,
  "series": [
    {
      "end": 1669106969000,
      "attributes": {},
      "metric": "(trace.express.request.hits / kubernetes.cpu.requests)",
      "interval": 10,
      "tag_set": [],
      "start": 1669106910000,
      "length": 6,
      "query_index": 0,
      "aggr": "sum",
      "scope": "env:bar,service:foo",
      "pointlist": [
        [
          1669106910000,
          3477.500077728184
        ],
        [
          1669106920000,
          3878.3334200208405
        ],
        [
          1669106930000,
          3725.8334166122
        ],
        [
          1669106940000,
          3674.166748790693
        ],
        [
          1669106950000,
          3673.3334154387335
        ],
        [
          1669106960000,
          6237.500139418993
        ]
      ],
      "expression": "(sum:trace.express.request.hits{env:bar,service:foo}.as_rate() / avg:kubernetes.cpu.requests{env:bar,service:foo}.rollup(10))",
      "unit": null,
      "display_name": "(trace.express.request.hits / kubernetes.cpu.requests)"
    }
  ],
  "to_date": 1669106966000,
  "query": "sum:trace.express.request.hits{service:foo,env:bar}.as_rate()/avg:kubernetes.cpu.requests{service:foo,env:bar}.rollup(10)",
  "message": "",
  "res_type": "time_series",
  "times": [],
  "from_date": 1669106906000,
  "group_by": [],
  "values": []
}

HPA events

  Warning  FailedGetExternalMetric       31m (x8 over 23h)  horizontal-pod-autoscaler  unable to get external metric development/s0-datadog-sum-trace-express-request-hits/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: foo,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: no matching metrics found for s0-datadog-sum-trace-express-request-hits
  Warning  FailedComputeMetricsReplicas  31m (x8 over 23h)  horizontal-pod-autoscaler  invalid metrics (1 invalid out of 1), first error is: failed to get s0-datadog-sum-trace-express-request-hits external metric: unable to get external metric development/s0-datadog-sum-trace-express-request-hits/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: foo,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: no matching metrics found for s0-datadog-sum-trace-express-request-hits

If I remove as_rate() from Query 3
Query 4: sum:trace.express.request.hits{service:foo,env:bar}/avg:kubernetes.cpu.requests{service:foo,env:bar}.rollup(10)
KEDA can get the metrics without any problem

# curl result of query 4

  "status": "ok",
  "resp_version": 1,
  "series": [
    {
      "end": 1669106960000,
      "attributes": {},
      "metric": "(trace.express.request.hits / kubernetes.cpu.requests)",
      "interval": 1,
      "tag_set": [],
      "start": 1669106910000,
      "length": 6,
      "query_index": 0,
      "aggr": "sum",
      "scope": "service:foo,env:bar",
      "pointlist": [
        [
          1669106910000,
          64426.34634984409
        ],
        [
          1669106920000,
          66898.41915361531
        ],
        [
          1669106930000,
          65685.75394920415
        ],
        [
          1669106940000,
          66132.75977946246
        ],
        [
          1669106950000,
          66287.59952733
        ],
        [
          1669106960000,
          66472.42117646785
        ]
      ],
      "expression": "(sum:trace.express.request.hits{service:foo,env:bar} / avg:kubernetes.cpu.requests{service:foo,env:bar}.rollup(10))",
      "unit": null,
      "display_name": "(trace.express.request.hits / kubernetes.cpu.requests)"
    }
  ],
  "to_date": 1669106966000,
  "query": "sum:trace.express.request.hits{service:foo,env:bar}/avg:kubernetes.cpu.requests{service:foo,env:bar}.rollup(10)",
  "message": "",
  "res_type": "time_series",
  "times": [],
  "from_date": 1669106906000,
  "group_by": [],
  "values": []
}

Expected Behavior

KEDA can get the metrics without any problem as long as it is working with curl for same query

Actual Behavior

Explained above

Steps to Reproduce the Problem

Explained above

Logs from KEDA operator

E1123 09:05:17.876484       1 datadog_scaler.go:318] keda_metrics_adapter/datadog_scaler "msg"="error getting metrics from Datadog" "error"="error when retrieving Datadog metrics: Get \"https://api.datadoghq.com/api/v1/query?from=1669194254&query=sum%3Atrace.express.request.hits%7Bservice%3Afoo%2Cenv%3Abar%7D.as_rate%28%29%2Favg%3Akubernetes.cpu.requests%7Bservice%3Afoo%2Cenv%3Abar%7D.rollup%2810%29&to=1669194314\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" "name"="foo" "namespace"="bar" "type"="ScaledObject"

KEDA Version

2.8.1

Kubernetes Version

< 1.23

Platform

Amazon Web Services

Scaler Details

Datadog

Anything else?

No response

@dogzzdogzz dogzzdogzz added the bug Something isn't working label Nov 23, 2022
@dogzzdogzz dogzzdogzz changed the title [Datadog scaler]. Failed to get the metrics for some specific datadog query [Datadog scaler] Failed to get the metrics for some specific datadog query Nov 23, 2022
@JorTurFer
Copy link
Member

JorTurFer commented Nov 24, 2022

Hi @dogzzdogzz ,
I have tried your JSONs injecting them into the datadog SDK and I can't reproduce the issue, queries 3 and 4 work without any problem. Are you sure that there isn't any strange situation like +Inf or -Inf, or IDK, but using those values I can't reproduce the issue. Is there any new logs related with this?

@JorTurFer JorTurFer reopened this Nov 24, 2022
@JorTurFer
Copy link
Member

BTW, I closed the issue by error, and I have reopened it

@dogzzdogzz
Copy link
Contributor Author

@JorTurFer Hmm...It's strange that you can not reproduce, It's happening on all of my clusters, to clarify the reproducible issue, just wanna double check if you do have metric trace.express.request.hits on your datadog and it does has some data ingested from the datadog agent ?

And could you kindly help to check if anything wrong in my scaledobject manifests to cause this issue ?

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  labels:
    scaledobject.keda.sh/name: query-1
  annotations:
    meta.helm.sh/release-name: foo
    meta.helm.sh/release-namespace: bar
  labels:
    app: foo
    app.kubernetes.io/managed-by: Helm
    helm.toolkit.fluxcd.io/name: foo
    helm.toolkit.fluxcd.io/namespace: bar
    scaledobject.keda.sh/name: foo
  name: query-1
  namespace: bar
spec:
  maxReplicaCount: 1
  minReplicaCount: 1
  scaleTargetRef:
    name: foo
  triggers:
  - authenticationRef:
      kind: ClusterTriggerAuthentication
      name: keda-trigger-auth-datadog-secret
    metadata:
      age: "60"
      metricUnavailableValue: "0"
      query: 'sum:trace.express.request.hits{service:foo,env:bar}.as_rate()'
      queryValue: "250"
    metricType: AverageValue
    type: datadog
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  labels:
    scaledobject.keda.sh/name: query-2
  name: query-2
  namespace: bar
spec:
  maxReplicaCount: 1
  minReplicaCount: 1
  scaleTargetRef:
    name: foo
  triggers:
  - authenticationRef:
      kind: ClusterTriggerAuthentication
      name: keda-trigger-auth-datadog-secret
    metadata:
      age: "60"
      metricUnavailableValue: "0"
      query: 'avg:kubernetes.cpu.requests{service:foo,env:bar}.rollup(10)'
      queryValue: "250"
    metricType: AverageValue
    type: datadog
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  labels:
    scaledobject.keda.sh/name: query-3
  name: query-3
  namespace: bar
spec:
  maxReplicaCount: 1
  minReplicaCount: 1
  scaleTargetRef:
    name: foo
  triggers:
  - authenticationRef:
      kind: ClusterTriggerAuthentication
      name: keda-trigger-auth-datadog-secret
    metadata:
      age: "60"
      metricUnavailableValue: "0"
      query: 'sum:trace.express.request.hits{service:foo,env:bar}.as_rate()/avg:kubernetes.cpu.requests{service:foo,env:bar}.rollup(10)'
      queryValue: "250"
    metricType: AverageValue
    type: datadog
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  labels:
    scaledobject.keda.sh/name: query-4
  name: query-4
  namespace: bar
spec:
  maxReplicaCount: 1
  minReplicaCount: 1
  scaleTargetRef:
    name: foo
  triggers:
  - authenticationRef:
      kind: ClusterTriggerAuthentication
      name: keda-trigger-auth-datadog-secret
    metadata:
      age: "60"
      metricUnavailableValue: "0"
      query: 'sum:trace.express.request.hits{service:foo,env:bar}/avg:kubernetes.cpu.requests{service:foo,env:bar}.rollup(10)'
      queryValue: "250"
    metricType: AverageValue
    type: datadog

Below is the HPA status of above manifests, you can see that query-1/2/4 can get the data without problem, but the metric of query-3 is always 0 because of metricUnavailableValue config

# kubectl get hpa keda-hpa-query-1 keda-hpa-query-2 keda-hpa-query-3 keda-hpa-query-4
NAME               REFERENCE                TARGETS               MINPODS   MAXPODS   REPLICAS   AGE
keda-hpa-query-1   Deployment/foo   193500m/250 (avg)     1         1         1          7m3s
keda-hpa-query-2   Deployment/foo   119m/250 (avg)        1         1         1          7m2s
keda-hpa-query-3   Deployment/foo   0/250 (avg)           1         1         1          7m2s
keda-hpa-query-4   Deployment/foo   62975001m/250 (avg)   1         1         1          7m1s

@JorTurFer
Copy link
Member

No no, I tried just modifying the client to return the json you sent. I haven't tried with datadog directly because I don't have anything working with datadog. If you could share all the manifest to spin up an scenario to reproduce your issue, I can use our datadog account to try it. (Sorry, I have 0 expertise with datadog and IDK how to generate the same scenario)

I can install datadog agent on my cluster, and then what do I need to deploy for generating those metrics?

@JorTurFer
Copy link
Member

Below is the HPA status of above manifests, you can see that query-1/2/4 can get the data without problem, but the metric of query-3 is always 0 because of metricUnavailableValue config

I remember that the original issue was a panic recovering metrics, not just a fallback to 0. Is this time the same issue? I mean, a fallback to 0 could mean for example that the time window is too small to recover metrics. It's not the same that behavior that a panic in the scaler

@dogzzdogzz
Copy link
Contributor Author

No no, I tried just modifying the client to return the json you sent. I haven't tried with datadog directly because I don't have anything working with datadog. If you could share all the manifest to spin up an scenario to reproduce your issue, I can use our datadog account to try it. (Sorry, I have 0 expertise with datadog and IDK how to generate the same scenario)

I can install datadog agent on my cluster, and then what do I need to deploy for generating those metrics?

Oh ok, trace.express.request.hits requires to install a dd-trace APM library to some service because it's http request count metrics, I think it might be too trouble to you. Let me check if I can reproduce with any other more common and existing metrics on datadog

I remember that the original issue was a panic recovering metrics, not just a fallback to 0. Is this time the same issue? I mean, a fallback to 0 could mean for example that the time window is too small to recover metrics. It's not the same that behavior that a panic in the scaler

I think it's not the same as issue 3448 because I already used rollup(10) to avoid the null data point

@dogzzdogzz
Copy link
Contributor Author

@JorTurFer I updated my curl script and make sure the time of TO is the current time, finally found out that the last one or last two data points in response are always null, I'll create a PR to fix this issue.

TO=$(($(date +%s))) && \
FROM=$(($END - 60)) && \
curl -X GET "https://api.datadoghq.com/api/v1/query?from=$FROM&to=$TO&query=sum:trace.express.request.hits\{service:foo,env:bar\}.as_rate()/avg:kubernetes.cpu.requests\{service:foo,env:bar\}.rollup(10)"

Response

{
  "status": "ok",
  "resp_version": 1,
  "series": [
    {
      "end": 1670221209000,
      "attributes": {},
      "metric": "(trace.express.request.hits / kubernetes.cpu.requests)",
      "interval": 10,
      "tag_set": [],
      "start": 1670221160000,
      "length": 5,
      "query_index": 0,
      "aggr": "sum",
      "scope": "env:bar,service:foo",
      "pointlist": [
        [
          1670221160000,
          2276.6667175541334
        ],
        [
          1670221170000,
          2467.5000551529242
        ],
        [
          1670221180000,
          2144.1667145925276
        ],
        [
          1670221190000,
          null
        ],
        [
          1670221200000,
          null
        ]
      ],
      "expression": "(sum:trace.express.request.hits{env:bar,service:foo}.as_rate() / avg:kubernetes.cpu.requests{env:bar,service:foo}.rollup(10))",
      "unit": null,
      "display_name": "(trace.express.request.hits / kubernetes.cpu.requests)"
    }
  ],
  "to_date": 1670221212000,
  "query": "sum:trace.express.request.hits{service:foo,env:bar}.as_rate()/avg:kubernetes.cpu.requests{service:foo,env:bar}.rollup(10)",
  "message": "",
  "res_type": "time_series",
  "times": [],
  "from_date": 1670221152000,
  "group_by": [],
  "values": []
}

@dogzzdogzz dogzzdogzz changed the title [Datadog scaler] Failed to get the metrics for some specific datadog query [Datadog scaler] Scaler returns fillValue if the last data point of query is null Dec 5, 2022
@dogzzdogzz
Copy link
Contributor Author

dogzzdogzz commented Dec 8, 2022

For anyone encounters this issue as well. there are some additional issues and details with Datadog API response mentioned in this comment
#3954 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
2 participants