gcp-pubsub throwing "could not find stackdriver metric" #5429

ddlenz · 2024-01-23T21:41:01Z

Report

After updating to 2.13.0, gcp_pub_sub_scaler repeatedly throws "error getting metric" and scale_handler throws "error getting scale decision" with "could not find stackdriver metric with query fetch pubsub_subscription" with unacked messages and failure to scale.

Expected Behavior

keda scales application from zero

Actual Behavior

keda fails to scale application from zero

Steps to Reproduce the Problem

create gcp-pubsub scaler
publish a message to trigger the scaler
check unacked messages and logs

Logs from KEDA operator

No response

KEDA Version

2.13.0

Kubernetes Version

Other

Platform

Google Cloud

Scaler Details

gcp pubsub

Anything else?

kubernetes version 1.28.3-gke.1203001

JorTurFer · 2024-01-24T18:12:46Z

Hello,
Does it work on previous versions? Could you share your ScaledObject?

Could you share KEDA operator logs as well?

Steps to Reproduce the Problem

create gcp-pubsub scaler

publish a message to trigger the scaler

check unacked messages and logs

This case is already covered by e2e tests and it works. One thing that can happen is that if you don't have messages, you won't get a metric because the API itself responses with an error (which is normal if there isn't any activity related with the queue AFAIK about pub-sub monitoring).

There is a change that could affect, but I don't think that it's affecting as the e2e tests still passed, and the change kept the default behavior (and that's why I ask for more info xD)

eremeevfd · 2024-01-25T10:24:25Z

First, thank you for an incredible and very useful product!

Unfotunatyel, we've encountered the same issue as well:
Logs from Keda:

2024-01-25T10:20:42Z    ERROR    scale_handler    error getting scale decision    {"scaledObject.Namespace": "algorithms-selfie", "scaledObject.Name": "translucency", "scaler": "pubsu
bScaler", "error": "could not find stackdriver metric with query fetch pubsub_subscription | metric 'pubsub.googleapis.com/subscription/num_undelivered_messages' | filter (resource.project_id == 'sasuke-core-dev' &
& resource.subscription_id == 'selfie_v2.translucency-2.0.0') | within 1m"}
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScalerState
    /workspace/pkg/scaling/scale_handler.go:764
 github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScaledObjectState.func1
    /workspace/pkg/scaling/scale_handler.go:628

However when I try to fetch the same metric in Google Metrics Explorer I get some results:

Here we can see that there were some messages, but now it's empty, but does it really return error? Shouldn't it be like zero or some null value?
Could it also be a problem with TriggerAuthentication maybe? Or it should emit another error about permissions

FrancoisPoinsot · 2024-01-30T16:36:00Z

I am having the exact same issue.
Same keda version
Kubernetes 1.26
Similar error log.
Also with gcp pubsub scaler.
Can also play the query successfuly in Google Metric Explorer.

I had to rollback to 2.12.1 because we had issue with some workload not scaling up.
That is what we detected originaly.

JorTurFer · 2024-01-30T22:06:57Z

So, does it work in KEDA v2.12.1 and not in KEDA v2.13.0?
By chance, wouldn't you have a way that I can follow to replicate the issue? Something like: push a message somehow, check KEDA somehow, etc...
I have 0 experience with GCP, so although I check the changes, I don't see any important thing (at least yet), and having a way to reproduce it in our account would be awesome to compare both versions. I mean, a replication way that works on v2.12.1 and doesn't work on v2.13.0

ekaputra07 · 2024-01-31T05:49:58Z

Hi, first of all, thanks for this great project!

But facing same issue here, I'm using:

KEDA v2.13.0 / GKE Autopilot / Pub/Sub

My ScaledObject:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: some-name
spec:
  scaleTargetRef:
    name: some-name
  triggers:
    - type: gcp-pubsub
      authenticationRef:
        name: trigger-authentication-dev
        kind: ClusterTriggerAuthentication
      metadata:
        mode: NumUndeliveredMessages
        value: "5"
        activationValue: "0"
        subscriptionName: projects/my-project/subscriptions/my-sub

The error:

2024-01-31T04:53:00Z	ERROR	gcp_pub_sub_scaler	error getting metric	{"type": "ScaledObject", "namespace": "default", "name": "some-name", "metricType": "pubsub.googleapis.com/subscription/num_undelivered_messages", "error": "could not find stackdriver metric with query fetch pubsub_subscription | metric 'pubsub.googleapis.com/subscription/num_undelivered_messages' | filter (resource.project_id == 'my-project' && resource.subscription_id == 'my-sub') | within 1m"}

I copy-pasted the query I found in error message and run it on GCP's Metrics Explorer:

Things that I noticed are:

using within 1m the query sometime return result, but most of the time it doesn't
using higher time range for example within 5m always working (return result)

Based on its documentation, it says:

subscription/num_undelivered_messages
Number of unacknowledged messages (a.k.a. backlog messages) in a subscription. Sampled every 60 seconds. After sampling, data is not visible for up to 120 seconds.

Those might not related, but I'm just trying to provide data as much as possible and hopefully they helps to debug the situation.

And based on the scaler code, looks like we mark it as an Error if stackdriver doesn't return value.

keda/pkg/scalers/gcp/gcp_stackdriver_client.go

Line 291 in cd57dd9

if err == iterator.Done {

FrancoisPoinsot · 2024-01-31T12:36:31Z

Some more context about what happened.

I have 8 gcp-pubsub scaledObject.
5 with minReplicaCount:0 .
3 with minReplicaCount:1 .

When we upgraded keda to 2.13.0, all 5 deployements targeted by scaledobject with minReplicaCount:0 started to have scaling issue.
They were scaled down to 0. No matter how many pods existed.

Those with minReplicaCount:1 seems unaffected.
So I am wondering if the problem is specificaly related to the Activation phase.

I am trying to get a reproduction scenario.

here is a basic manifest.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: test-francois
  namespace: test-francois
spec:
  maxReplicaCount: 5
  minReplicaCount: 0
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: test-francois
  triggers:
  - authenticationRef:
      kind: ClusterTriggerAuthentication
      name: keda-clustertrigger-auth-gcp-credentials
    metadata:
      mode: SubscriptionSize
      subscriptionName: test-francois-sub
      value: "4"
    type: gcp-pubsub

You need to publish manually in the topic in question. Don't need to ack any message. Just use the value to scale up/down any random deployment.

Using this I can reproduce having an error log in keda-operator.
Also I get a steady increase in keda_scaler_errors metric.
Also, watching the generated hpa, this is what I see:

NAME                     REFERENCE                  TARGETS     MINPODS   MAXPODS   REPLICAS   AGE
keda-hpa-test-francois   Deployment/test-francois   2/4 (avg)   1         5         1          117m
keda-hpa-test-francois   Deployment/test-francois   <unknown>/4 (avg)   1         5         1          117m

Roughly 1/3 of the entries show the target with <unknown>

However this setup is not enough to reproduce the issue above. The deployement is not being scaled down to 0.
But I wonder if that is just a matter of how frequent those error shows up.

FrancoisPoinsot · 2024-01-31T16:44:26Z

Looking at https://github.com/kedacore/keda/pull/5246/files which as been merged for 2.13.0, I see GetMetrics call has been replaced with QueryMetrics

https://github.com/kedacore/keda/pull/5246/files#diff-aaa03b99f93c680bd727f6f0a3e9d932c34344ad25b3a254f9a56178c853fe3bR233

And getMetrics query 2 minutes back in time instead of the 1m that is currently used.

keda/pkg/scalers/gcp/gcp_stackdriver_client.go

Line 219 in cd57dd9

startTime := time.Now().UTC().Add(time.Minute * -2)

JorTurFer · 2024-02-01T09:28:07Z

Nice research! I was thinking that maybe we changed a default behavior by mistake and it looks like that (and we have to fix it)

I'm thinking on adding the aggregation window as optional parameter too (for next version)

JorTurFer · 2024-02-01T09:59:23Z

@FrancoisPoinsot , I've reverted the change in default time horizon in this PR.

The generated image with that change is ghcr.io/kedacore/keda-test:pr-5452-c5cf46759c5691b29bb45c6bbb60e3be10cd9f7a. Would you be willing to test it?

FrancoisPoinsot · 2024-02-01T11:11:19Z

I confirm that with ghcr.io/kedacore/keda-test:pr-5452-c5cf46759c5691b29bb45c6bbb60e3be10cd9f7a the error log is gone and keda_scaler_errors doesn't show any error.
And the HPA behave as expected.
That looks like a fix.

JorTurFer · 2024-02-01T11:12:34Z

Do you see any increase of the goroutines now?

FrancoisPoinsot · 2024-02-01T11:13:48Z

goroutines count looks stable too.

JorTurFer · 2024-02-01T11:15:54Z

Thanks for the feedback ❤️

Probably I was right and the issue with the routines was the not closed properly. As now the scaler isn't being regenerated on each check, the issue is mitigated. I've included the proper closing of the connection too as part of the PR: 4084ee0

JoelDimbernat · 2024-04-17T17:31:18Z

For anyone still encountering this error, ensure that your service account is granted the role roles/monitoring.viewer on the project. It's necessary to access pubsub.googleapis.com/subscription/num_undelivered_messages and I don't think it's documented anywhere.

ddlenz added the bug Something isn't working label Jan 23, 2024

keda-automation added this to Roadmap - KEDA Core Jan 23, 2024

github-project-automation bot moved this to To Triage in Roadmap - KEDA Core Jan 23, 2024

JorTurFer mentioned this issue Feb 1, 2024

fix(gcp scalers): Restore previous time horizon to fix missing metrics and properly close the connecctions #5452

Merged

1 task

JorTurFer mentioned this issue Feb 1, 2024

keda-operator v2.13.0 leaks go routines #5448

Closed

JorTurFer closed this as completed in #5452 Feb 12, 2024

github-project-automation bot moved this from To Triage to Ready To Ship in Roadmap - KEDA Core Feb 12, 2024

JorTurFer mentioned this issue Aug 22, 2024

Could not find stackdriver metric with query fetch pubsub_subscription - Google Cloud Platform‎ Pub/Sub #5855

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gcp-pubsub throwing "could not find stackdriver metric" #5429

gcp-pubsub throwing "could not find stackdriver metric" #5429

ddlenz commented Jan 23, 2024

JorTurFer commented Jan 24, 2024 •

edited

Loading

Steps to Reproduce the Problem

eremeevfd commented Jan 25, 2024

FrancoisPoinsot commented Jan 30, 2024

JorTurFer commented Jan 30, 2024 •

edited

Loading

ekaputra07 commented Jan 31, 2024

FrancoisPoinsot commented Jan 31, 2024 •

edited

Loading

FrancoisPoinsot commented Jan 31, 2024 •

edited

Loading

JorTurFer commented Feb 1, 2024 •

edited

Loading

JorTurFer commented Feb 1, 2024

FrancoisPoinsot commented Feb 1, 2024

JorTurFer commented Feb 1, 2024

FrancoisPoinsot commented Feb 1, 2024

JorTurFer commented Feb 1, 2024

JoelDimbernat commented Apr 17, 2024

gcp-pubsub throwing "could not find stackdriver metric" #5429

gcp-pubsub throwing "could not find stackdriver metric" #5429

Comments

ddlenz commented Jan 23, 2024

Report

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Logs from KEDA operator

KEDA Version

Kubernetes Version

Platform

Scaler Details

Anything else?

JorTurFer commented Jan 24, 2024 • edited Loading

Steps to Reproduce the Problem

eremeevfd commented Jan 25, 2024

FrancoisPoinsot commented Jan 30, 2024

JorTurFer commented Jan 30, 2024 • edited Loading

ekaputra07 commented Jan 31, 2024

FrancoisPoinsot commented Jan 31, 2024 • edited Loading

FrancoisPoinsot commented Jan 31, 2024 • edited Loading

JorTurFer commented Feb 1, 2024 • edited Loading

JorTurFer commented Feb 1, 2024

FrancoisPoinsot commented Feb 1, 2024

JorTurFer commented Feb 1, 2024

FrancoisPoinsot commented Feb 1, 2024

JorTurFer commented Feb 1, 2024

JoelDimbernat commented Apr 17, 2024

JorTurFer commented Jan 24, 2024 •

edited

Loading

JorTurFer commented Jan 30, 2024 •

edited

Loading

FrancoisPoinsot commented Jan 31, 2024 •

edited

Loading

FrancoisPoinsot commented Jan 31, 2024 •

edited

Loading

JorTurFer commented Feb 1, 2024 •

edited

Loading