Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure Service Bus Scaler Sends GetAccess Requests on every getQueueLength message #4262

Closed
tshaiman opened this issue Feb 21, 2023 · 21 comments · Fixed by #4273
Closed

Azure Service Bus Scaler Sends GetAccess Requests on every getQueueLength message #4262

tshaiman opened this issue Feb 21, 2023 · 21 comments · Fixed by #4273
Assignees
Labels
bug Something isn't working

Comments

@tshaiman
Copy link

tshaiman commented Feb 21, 2023

Keda Service Bus Scaler keeps asking for Arm GetAccess on Service Bus -> no reused of AdminClient

we saw a significant increase of getAcccess On Azure ARM Endpoint, aiming to authenticate that the request has access to the service bus.

  1. This happens when all the pods are at scale zero
  2. we have 16 queues , with 8 environments , with polling interval of 30 seconds => that is around 256 calls Per minute
  3. Looking at the Keda Service bus Scaler code we see the following

azure_servicebus_scaler.go

func (s *azureServiceBusScaler) getAzureServiceBusLength(ctx context.Context) (int64, error) {
	// get adminClient
	adminClient, err := s.getServiceBusAdminClient()
	if err != nil {
		return -1, err
	}
	// switch case for queue vs topic here
	switch s.metadata.entityType {
	case queue:
		return getQueueLength(ctx, adminClient, s.metadata)
	case subscription:
		return getSubscriptionLength(ctx, adminClient, s.metadata)
	default:
		return -1, fmt.Errorf("no entity type")
	}
}

a servicebusAdminClient is created but it is not assigned to any varaible on "s" (azureServiceBusScaler)

func (s *azureServiceBusScaler) getServiceBusAdminClient() (*admin.Client, error) {
	if s.client != nil {
		return s.client, nil
	}
  .......//Creating the client
}

so each time a check is made to see if s.client is initialized , it is always be null , hence new client will be created over and over again.

Expected Behavior

  • The admin.Client needs to be stored in the azureServiceBusScaler object and not be created every time

Actual Behavior

  • The admin.Client is created on every "getQueueLength" request hence cause Azure ARM calls to be throttled with "429-Too many requests "

  • when we query Azure ARM Endpoint for investigation we found the following :

HttpIncomingRequests
| where TIMESTAMP > ago(1d)| where subscriptionId == "24237b72-8546-4da5-b204-8c3cb76dd930"
| summarize count() by operationName
| order by count_ desc

POST/SUBSCRIPTIONS/RESOURCEGROUPS/PROVIDERS/MICROSOFT.SERVICEBUS/NAMESPACES/QUEUES/PROVIDERS/MICROSOFT.AUTHORIZATION/CHECKACCESS - 709361 calls

so CheckAccess API has 709K calls per days , even when no pods are running.

Steps to Reproduce the Problem

  1. run Keda with at least 160 queues and use polling interval of 30 seconds/ or use less queues with low polling interval
  2. Goto Azure Portal and view the number of Requests on the Service Bus Queue :

image 1 : queue length is empty
image

image 2: number of calls is 250-300 per minute.
image

  1. ensure all calls are
    POST/SUBSCRIPTIONS/RESOURCEGROUPS/PROVIDERS/MICROSOFT.SERVICEBUS/NAMESPACES/QUEUES/PROVIDERS/MICROSOFT.AUTHORIZATION/CHECKACCESS

Logs from KEDA operator

no errros

KEDA Version

2.9.3

Kubernetes Version

1.24

Platform

Microsoft Azure

Scaler Details

Azure Service Bus

Anything else?

  • would like to submit a patch after this is confirmed to be a bug
@tshaiman tshaiman added the bug Something isn't working label Feb 21, 2023
@JorTurFer
Copy link
Member

JorTurFer commented Feb 23, 2023

Hello!
Nice catch! KEDA asks the queues even if there isn't any pod, but definitively the client should be reused in the scaler

Are you willing to contribute with the fix?

I think that a small refactor of this function could be enough.

Something like this could be enough:

func (s *azureServiceBusScaler) getServiceBusAdminClient() (*admin.Client, error) {
	if s.client != nil {
		return s.client, nil
	}
	var err error
	var client *admin.Client
	switch s.podIdentity.Provider {
	case "", kedav1alpha1.PodIdentityProviderNone:
		client, err = admin.NewClientFromConnectionString(s.metadata.connection, nil)
	case kedav1alpha1.PodIdentityProviderAzure, kedav1alpha1.PodIdentityProviderAzureWorkload:
		creds, chainedErr := azure.NewChainedCredential(s.podIdentity.IdentityID, s.podIdentity.Provider)
		if chainedErr != nil {
			return nil, err
		}
		client, err = admin.NewClient(s.metadata.fullyQualifiedNamespace, creds, nil)
	default:
		err = fmt.Errorf("incorrect podIdentity type")
	}

	s.client = client
	return client, err
}

@zroubalik
Copy link
Member

Great find!

@xoanmm
Copy link
Contributor

xoanmm commented Feb 23, 2023

Hey,

I want to tackle this issue

@tshaiman
Copy link
Author

oh man you are quick i was just about to submit a PR 😂

you snooze you loose

@tshaiman
Copy link
Author

tshaiman commented Feb 25, 2023

@JorTurFer / @tomkerkhove / @xoanmm : I'm afriad that the fix in #4273 is now causing another issue.

I took the fix for a test drive ( build custom docker image) and after 1 hour I started getting Pod Identity Exception
I think this is because Azure Tokens are valid for 1 hour and the code does not take this into account but keep using the cached admin client.

peeking at the code of Identity in Keda and with knowledge on how Azure Identity library work I got to say this entire Identity issue needs to be discussed.

having said that the fix is not working for a pod that is longed lived .

@JorTurFer
Copy link
Member

What error do you see? In theory, the service bus SDK should call to GetToken when the token has expired. I mean, maybe there is other error, but I think that this implementation is correct as we should cache the client instead of creating a new one each time.
KEDA also recreates the scaler when there is any error getting the metrics, so even if the client is cached in the scaler, the whole scaler is regenerated on any error.
I think that the error should be in other place.

@tshaiman
Copy link
Author

I see the same-old ,most common error that keda does not have a valid token to use.
I do agree that we should cache the client , but I was looking how tokens are being asked and used in Keda , which made me rethink about the entire issue we see here ( and in the infamous Workload Identity bugs we saw few months ago )
I started a discussion here :
#4279

I agree that the SDK needs to call GetToken behind the scenes , but as I opened the custom implementation/wrapper of Identity in keda , as described in the discussion , I'm not sure this will eventually happen since the usage of the well defined contract against Azure .Identity library is not met in this case.

@JorTurFer
Copy link
Member

JorTurFer commented Feb 25, 2023

I'm deploying that change into my cluster right now, let's wait until tomorrow to check how it looks.
What are you using? Workload Identity or AAD-Pod-Identity?

@JorTurFer JorTurFer reopened this Feb 25, 2023
@github-project-automation github-project-automation bot moved this from Ready To Ship to Proposed in Roadmap - KEDA Core Feb 25, 2023
@github-project-automation github-project-automation bot moved this from Proposed to Ready To Ship in Roadmap - KEDA Core Feb 25, 2023
@JorTurFer JorTurFer reopened this Feb 25, 2023
@github-project-automation github-project-automation bot moved this from Ready To Ship to Proposed in Roadmap - KEDA Core Feb 25, 2023
@tshaiman
Copy link
Author

pod-identity , as workload identity had a bug that related to their webhook few months ago

@JorTurFer
Copy link
Member

I don't see any error for the moment (76 min):
image

I'll keep both apps (aad-pod-identity and wi) deployed some days to check if there is any error

@JorTurFer
Copy link
Member

Nothing yet (125min), Let's see after 12-24 hours
image

@tshaiman
Copy link
Author

what i did in my test was to replace only the fixed code ( saving admin client ) into v2.9.3
that could be insufficient with more fixes merged to master

@tshaiman
Copy link
Author

tshaiman commented Feb 26, 2023 via email

@JorTurFer
Copy link
Member

JorTurFer commented Feb 26, 2023

is this Lens ?

yes, It's openlens (an open source fork of lens).
It isn't perfect and sometime I still need kubectl, but for majority of the case works realy nice

@JorTurFer
Copy link
Member

Same behavior after 11 hours, I guess that we need to wait until 12/24 hours to ensure that the token is regenerated.
image

BTW, do you see any error in the aad-pod-identity pods? as it's an authentication proxy, maybe there is any error there that givesd more info about the problems

@tshaiman
Copy link
Author

is it 2.9.3 with the fix , or is it s2.9.4 candidate.
what i did is to apply the fix on this bug to 2/.9.3 which may be incorrect.
can I get an image tag to test it ?

@JorTurFer
Copy link
Member

I am using 'main' tag because it's the only that contains the fix for the moment ('main' tag contains the next version changes)

Be careful trying that image because it contains the certificarte management (a new feature introduced to improve the security), so you can't just update the tag, there are required changes in the RBAC to allow this management. Other option is to manage the certificates from your side, but that's more complicated. Next version docs explains it https://keda.sh/docs/2.10/operate/security/#use-your-own-tls-certificates
And we will release a blog post explaining how to do it with cert-manager after release

@tshaiman
Copy link
Author

ok that is too much for now indeed
i will just save the auth client on my 2.9.3 tag and rebuild with ‘make container-build׳

i don’t want you to spend time on this if after 12 hours it works - most chances i didn’t build correctly .

will update shortly

@JorTurFer
Copy link
Member

JorTurFer commented Feb 26, 2023

21H without any error:
image

Tomorrow morning I'll check if there is any error after 24h. I guess that if tomorrow this still works, the access token renewal is done correctly and I can close the issue again 😄
I created 2 ScaledObject, one for add-pod-identity and other one for aad-wi so both are covered with this test

@tshaiman
Copy link
Author

@JorTurFer : conducted my tests as well and confirmed to be working correctly.
the problem was running against "workload-identity" which is still not fixed completely by workload-identity team with the webhook and its policy .

thanks for all your effots and clarifications

@github-project-automation github-project-automation bot moved this from Proposed to Ready To Ship in Roadmap - KEDA Core Feb 26, 2023
@JorTurFer
Copy link
Member

Nice!

@JorTurFer JorTurFer moved this from Ready To Ship to Done in Roadmap - KEDA Core Mar 13, 2023
xoanmm pushed a commit to xoanmm/keda that referenced this issue Mar 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants