Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Sending resource to Azure even if there are no configuration changes #2590

Closed
OleksandrBrodskiy opened this issue Nov 15, 2022 · 3 comments
Labels
bug 🪲 Something isn't working

Comments

@OleksandrBrodskiy
Copy link

Version of Azure Service Operator
v2.0.0-beta.0 , v2.0.0-beta.3
AKS version - 1.24.6

Describe the bug
Each reconcile cycle have been finishing with an action to send on azure BeginCreateOrUpdate even if there are no changes in configuration since the previous reconcile cycle.
In our case with a huge amount of resources managed by ServiceOperator (>300), we got an error
Number of write requests for subscription '***' exceeded the limit of '1200' for time interval '01:00:00'. Please try again after '303' seconds.
Changing azureSyncPeriod didn't resolved our issue, because on the next cycle all resources were updated at the same time

Expected behavior
Shouldn't be any requests to Azure if the resource configuration was not changed

Screenshots

  • 1115 14:58:12.587253 1 generic_controller.go:281] controllers/StorageAccountController "msg"="Reconcile invoked" "azureName"="storageName" "name"="storageName" "namespace"="namespaceName" "generation"=1 "kind"="*v1beta20210401storage.StorageAccount" "resourceVersion"="103415751"

  • 1115 14:58:12.587312 1 azure_generic_arm_reconciler_instance.go:157] controllers/StorageAccountController "msg"="DetermineCreateOrUpdateAction" "azureName"="storageName" "name"="storageName" "namespace"="namespaceName" "condition"="Condition [Ready], Status = "True", ObservedGeneration = 1, Severity = "", Reason = "Succeeded", Message = "", LastTransitionTime = "2022-11-15 14:30:15 +0000 UTC"" "pollerID"="" "resumeToken"=""

  • 1115 14:58:12.587340 1 azure_generic_arm_reconciler_instance.go:65] controllers/StorageAccountController "msg"="Reconciling resource" "azureName"="storageName" "name"="storageName" "namespace"="namespaceName" "action"="BeginCreateOrUpdate"

  • 1115 14:58:12.587648 1 azure_generic_arm_reconciler_instance.go:288] controllers/StorageAccountController "msg"="About to send resource to Azure" "azureName"="storageName" "name"="storageName" "namespace"="namespaceName"

  • 1115 14:58:13.672958 1 azure_generic_arm_reconciler_instance.go:306] controllers/StorageAccountController "msg"="Successfully sent resource to Azure" "azureName"="storageName" "name"="storageName" "namespace"="namespaceName" "id"="/subscriptions/subscriptionId/resourceGroups/rgName/providers/Microsoft.Storage/storageAccounts/storageName"

  • 1115 14:58:13.673040 1 azure_generic_arm_reconciler_instance.go:378] controllers/StorageAccountController "msg"="Resource successfully created" "azureName"="storageName" "name"="storageName" "namespace"="namespaceName" "resourceID"="/subscriptions/subscriptionId/resourceGroups/rgName/providers/Microsoft.Storage/storageAccounts/storageName"

  • 1115 14:58:13.673257 1 recorder.go:103] events "msg"="Normal" "message"="Successfully sent resource to Azure with ID "/subscriptions/subscriptionId/resourceGroups/rgName/providers/Microsoft.Storage/storageAccounts/storageName"" "object"={"kind":"StorageAccount","namespace":"namespaceName","name":"storageName","uid":"b4a1asd0-52d6-4795-a1cc-474e11b20ab4","apiVersion":"storage.azure.com/v1beta20210401storage","resourceVersion":"103415751"} "reason"="BeginCreateOrUpdate"

  • 1115 14:58:13.727399 1 secrets_retriever.go:51] controllers/StorageAccountController "msg"="Retrieving secrets from Azure" "azureName"="storageName" "name"="storageName" "namespace"="namespaceName"

  • 1115 14:58:13.908235 1 secrets_retriever.go:57] controllers/StorageAccountController "msg"="Successfully retrieved secrets" "azureName"="storageName" "name"="storageName" "namespace"="namespaceName" "SecretsToWrite"=1

@OleksandrBrodskiy OleksandrBrodskiy added the bug 🪲 Something isn't working label Nov 15, 2022
@theunrepentantgeek
Copy link
Member

The behaviour you're seeing is by-design - but we need to do something to prevent the throttling you're experiencing.

ASO is treating the custom resources in your cluster as the goal state and is issuing PUTs of those resources to Azure to ensure they haven't drifted from that goal state.

Much earlier in the life of ASO, we were only issuing updates to Azure when the Custom Resource was modified, which resulted in problems for some customers when changes were made in Azure and the resources drifted from the desired configuration. We also discovered that this was contrary to the expected behavour of a Kubernetes Operator. We removed the use of a spec-hash in #2022.

Our desired behaviour is to be smarter about how we reconcile - there's discussion on this in #1491.

Changing azureSyncPeriod didn't resolved our issue, because on the next cycle all resources were updated at the same time

I'll check in with the rest of the ASO team - I would have expected the reconciles to be somewhat spread out in time.

@matthchr
Copy link
Member

Shouldn't be any requests to Azure if the resource configuration was not changed

As @theunrepentantgeek mentioned, it is by design that ASO issues requests to Azure periodically even when in steady state. This is to correct drift on the Azure side if changes have been made there without the operators knowledge. As you correctly determined, the azureSyncPeriod can be used to configure how often this happens.

If you really want ASO to do nothing while in steady state, you can set the syncPeriod really high (100 years). This will effectively accomplish that, even if it's a bit of a hack.

We also just recently changed the default syncPeriod from 15m to 1h (#2578) due to throttling concerns. This will be included in the beta.4 release that is upcoming.

It is unlikely that ASO will ever recommend issuing no requests to Azure while you're in steady state. With that said, we are definitely aware that throttling is a problem with this pattern and in general.

The changes I see requested here are:

  1. Move towards Reconcile should perform a diff with Azure rather than relying on a spec hash #1491, which would turn these requests from PUTs to GETs. GETs have higher throttling limits (often time significantly higher).
  2. Increase the jitter that the sync action has so that it doesn't have as much chance of happening around the same time. There is already jitter but it's only 10% so a 1h syncPeriod will result in a request to Azure in 54 to 66 minutes, which isn't all that big of a time window. I can increase this to 25%, which moves 1h to a 45m to 75m range, which is significantly bigger.

What you can do in the meantime:
Since there is already some jitter, you should be able to basically eliminate throttling by setting the syncPeriod to something long like 1 day. That means requests to Azure 1x every ~22 - 26h, which should spread your 300 resources across a ~5h timerange for an average of 60 requests / hr, far below the 1200 limit.

The other thing to check for is a resource stuck in a bad state triggering much faster requests (ASO logs or metrics can help you find this). We had another customer report something like this, and an improvement is coming in beta.4 as well (see #2575)

@matthchr
Copy link
Member

Closing this as the work is already being tracked by #1491 and #2597. Feel free to reply here and reopen or reply on either of those issues if you have more thoughts on this topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🪲 Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants