Reconcile should perform a diff with Azure #2811

theunrepentantgeek · 2023-03-27T23:43:30Z

At the moment we are doing a unilateral PUT of each resource when we reconcile; this works but has some drawbacks.

Users with many resources have run up against throttling issues (e.g. Bug: ASO SubscriptionRequestsThrottled #2591, Bug: using azure service operator deploy PostgreSQL flexible servers encounter error in azure "Number of write requests for subscription xxx exceeded the limit of '1200' for time interval '01:00:00'. Please try again after xx seconds" #2643)
Users have raised bugs because they thought the operator should only PUT when there's a change required (e.g. Bug: Sending resource to Azure even if there are no configuration changes #2590)

We should diff the current state of the resource with the spec and only do a PUT where required (see #2600 for potential design).

Follow-up to #1491 as much of that has been implemented already.

theunrepentantgeek · 2023-03-27T23:44:31Z

Copied from earlier comment:

Some examples of where this is done in other operators:

CAPZ
Terraform

matthchr · 2023-03-29T16:45:30Z

Another thing I recently realized: Diffing will cause issues with Azure Policy, see: kubernetes-sigs/cluster-api-provider-azure#3009 (comment)

matthchr · 2023-03-29T18:16:11Z

Another idea that just occurred to me, documenting it here in case it's actually a good one (no guarantees): We could have a heuristic that reduced PUT load while at the same time not requiring the full investment and problems that come with client-side diffing.

A good example of such a heuristic would be:

After the PUT, take the shape of the returned Azure resource (the Status), hash it and store it in an annotation
Before any subsequent PUT, issue a GET and compare the hash vs the hash we have.

We could have manually crafted customizations per resource, or possibly automatically generate customizations from the Swagger to ignore fields like readonly times that are constantly changing and would always cause the payloads to mismatch. Although I actually don't think there are that many constantly changing fields in the APIs I have looked at.

This approach has a lot of advantages:

It works with Azure policy.
It's (relatively) easy to implement.
It doesn't require perfect upstream Swagger specifications to correctly notate every readOnly field

If we like this general idea but don't love the hash (hard to show exactly what diffed w/ a hash), we could store a compressed or uncompressed raw annotation of the status JSON instead. (I say compressed because sticking the entire status into an annotation without shrinking it somehow might cause problems).
I do think a hash would work though as we could log the input value in the logs so we don't necessarily need a diff constructible from the annotation as users (or us) can get it manually from the logs.

matthchr · 2023-03-30T20:33:23Z

I believe that the above is actually what Terraform does - from what I can tell they issue a GET after a PUT and then use the result of that GET as the statefile.

They only (manually) extract certain fields from the GET response: presumably fields that they know are writable, thus ignoring all of the readonly fields.

theunrepentantgeek · 2023-07-31T23:08:09Z

We're still keen on doing this.

petrk-75 · 2023-08-10T13:41:36Z

Is there any estimate when ASO will stop using PUT for reconciliation and will use the DIFF described above instead? Is there some guarantee it will get to 2.4.0? Any rough time schedule?

kmitawojciech · 2023-08-10T13:42:42Z

We're still keen on doing this.

Any specific plans when we could expect this feature?

matthchr · 2023-08-10T17:14:50Z

We're still keen on doing this.

Any specific plans when we could expect this feature?

It's not currently that high on the list, mostly because it's quite difficult to implement generically. Can you expand on why you want/need it? AFAIK there are two key reasons it's interesting over just doing PUT:

To reduce PUT load on the subscription and avoid throttling. We've made improvements to how often we PUT already though, and since then I haven't heard too many users run into throttling concerns.
To avoid messing with resources whose PUT isn't idempotent, unless there is a change to be made. AFAIK very few resources fall into this category.

Does your desire fall into one of the above? Or is there another category of reason we're missing?

petrk-75 · 2023-08-10T17:34:05Z

We are working on a solution which needs to be scalable. As soon as users/tenants create new services, we need to create new resources. The number of expected resources could be higher than the current ASO throughput. Having a limit of 300/1200 resources per subscription may not be sufficient. So there's neither an estimate nor a guarantee it will be part of 2.4.0, do I understand it right?

matthchr · 2023-08-10T17:52:35Z

Ah, yes our FAQ is actually slightly out of date here: It says,

Azure subscriptions have a default limit of ~1200 PUTs

This is not actually strictly true. The limit is 1200 PUTs/hr per HTTP connection (well per frontend instance but HTTP connections are pinned to an instance so it basically boils down to that). We now have an HTTP client configured that uses multiple connections (see #2685), so the limit is higher than it used to be.

After we made the above changes, we haven't seen users actually hitting throttling in practice. That doesn't mean there isn't a limit (there is), but it's higher than 1200/hr by probably something like an order of magnitude. There are also limits on GETs so there's always going to be a maximum number of resources you can manage in a single subscription.

Don't necessarily let that FAQ entry scare you away - I've made a note to update it, but in reality between the improvements we've done and the ability to tune the azureSyncInterval you should have all of the tools you need to scale to a large number of resources today, even without this diffing logic.

That's not to say we aren't going to do this diffing stuff, but the approach we have now is actually pretty good from a "number of resources" perspective and so we're waiting for more signal to bump the importance of this up.

matthchr · 2023-12-04T23:20:16Z

This is still something we're interested in, although given our throttling changes it doesn't seem as critical.

matthchr · 2024-04-22T23:57:51Z

Still interested in this, it would still solve some theoretical problems, but with Azure's updated throttling that's being rolled out (or maybe is already out?) and the improvements we made last year and the year before it doesn't seem as urgent as it once did.

theunrepentantgeek · 2024-10-01T20:11:27Z

Following up on this:

The limit is 1200 PUTs/hr per HTTP connection

ARM throughput limits have been increased significantly in 2024:

Throttling limits have increased by roughly 30 times for writes, 2.4 times for deletes, and 7.5 times for reads.

-- See Modernizing Azure Resource Manager Throttling

kchudnovskiy · 2024-10-01T21:39:44Z

The updated limits are higher in general, but the way throttling is being applied has also changed. For writes it is 200 with 10 second refill rate for a subscription and the same rate per tenant. Refill happens every second. There is an example which suggests the refill rate is 10.

For example, with a write bucket size of 200 tokens and a refill rate of 10 tokens per second, even after depleting the bucket with 200 API requests in a second, you can continue making 10 API requests per second.

The way I read it then, having multiple HTTP connection does not help and possibly makes things worse. A burst of requests would mean the initial pool is exhausted quicker.

theunrepentantgeek added the new-feature label Mar 27, 2023

github-project-automation bot added this to Azure Service Operator Roadmap Mar 27, 2023

github-project-automation bot moved this to Backlog in Azure Service Operator Roadmap Mar 27, 2023

github-actions bot added the needs-triage 🔍 label Mar 27, 2023

theunrepentantgeek mentioned this issue Mar 27, 2023

Reconcile should perform a diff with Azure rather than relying on a spec hash #1491

Closed

matthchr added this to the v2.1.0 milestone Apr 3, 2023

matthchr removed the needs-triage 🔍 label Apr 3, 2023

theunrepentantgeek modified the milestones: v2.6.0, v2.7.0 Dec 11, 2023

matthchr removed this from the v2.7.0 milestone Feb 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconcile should perform a diff with Azure #2811

Reconcile should perform a diff with Azure #2811

theunrepentantgeek commented Mar 27, 2023 •

edited

Loading

theunrepentantgeek commented Mar 27, 2023

matthchr commented Mar 29, 2023

matthchr commented Mar 29, 2023 •

edited

Loading

matthchr commented Mar 30, 2023

theunrepentantgeek commented Jul 31, 2023

petrk-75 commented Aug 10, 2023 •

edited

Loading

kmitawojciech commented Aug 10, 2023

matthchr commented Aug 10, 2023

petrk-75 commented Aug 10, 2023 •

edited

Loading

matthchr commented Aug 10, 2023

matthchr commented Dec 4, 2023

matthchr commented Apr 22, 2024

theunrepentantgeek commented Oct 1, 2024

kchudnovskiy commented Oct 1, 2024 •

edited

Loading

Reconcile should perform a diff with Azure #2811

Reconcile should perform a diff with Azure #2811

Comments

theunrepentantgeek commented Mar 27, 2023 • edited Loading

theunrepentantgeek commented Mar 27, 2023

matthchr commented Mar 29, 2023

matthchr commented Mar 29, 2023 • edited Loading

matthchr commented Mar 30, 2023

theunrepentantgeek commented Jul 31, 2023

petrk-75 commented Aug 10, 2023 • edited Loading

kmitawojciech commented Aug 10, 2023

matthchr commented Aug 10, 2023

petrk-75 commented Aug 10, 2023 • edited Loading

matthchr commented Aug 10, 2023

matthchr commented Dec 4, 2023

matthchr commented Apr 22, 2024

theunrepentantgeek commented Oct 1, 2024

kchudnovskiy commented Oct 1, 2024 • edited Loading

theunrepentantgeek commented Mar 27, 2023 •

edited

Loading

matthchr commented Mar 29, 2023 •

edited

Loading

petrk-75 commented Aug 10, 2023 •

edited

Loading

petrk-75 commented Aug 10, 2023 •

edited

Loading

kchudnovskiy commented Oct 1, 2024 •

edited

Loading