Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconcile should perform a diff with Azure #2811

Open
theunrepentantgeek opened this issue Mar 27, 2023 · 14 comments
Open

Reconcile should perform a diff with Azure #2811

theunrepentantgeek opened this issue Mar 27, 2023 · 14 comments

Comments

@theunrepentantgeek
Copy link
Member

theunrepentantgeek commented Mar 27, 2023

At the moment we are doing a unilateral PUT of each resource when we reconcile; this works but has some drawbacks.

We should diff the current state of the resource with the spec and only do a PUT where required (see #2600 for potential design).

Follow-up to #1491 as much of that has been implemented already.

@theunrepentantgeek
Copy link
Member Author

Copied from earlier comment:

Some examples of where this is done in other operators:

CAPZ
Terraform

@matthchr
Copy link
Member

Another thing I recently realized: Diffing will cause issues with Azure Policy, see: kubernetes-sigs/cluster-api-provider-azure#3009 (comment)

@matthchr
Copy link
Member

matthchr commented Mar 29, 2023

Another idea that just occurred to me, documenting it here in case it's actually a good one (no guarantees): We could have a heuristic that reduced PUT load while at the same time not requiring the full investment and problems that come with client-side diffing.

A good example of such a heuristic would be:

  1. After the PUT, take the shape of the returned Azure resource (the Status), hash it and store it in an annotation
  2. Before any subsequent PUT, issue a GET and compare the hash vs the hash we have.

We could have manually crafted customizations per resource, or possibly automatically generate customizations from the Swagger to ignore fields like readonly times that are constantly changing and would always cause the payloads to mismatch. Although I actually don't think there are that many constantly changing fields in the APIs I have looked at.

This approach has a lot of advantages:

  1. It works with Azure policy.
  2. It's (relatively) easy to implement.
  3. It doesn't require perfect upstream Swagger specifications to correctly notate every readOnly field

If we like this general idea but don't love the hash (hard to show exactly what diffed w/ a hash), we could store a compressed or uncompressed raw annotation of the status JSON instead. (I say compressed because sticking the entire status into an annotation without shrinking it somehow might cause problems).
I do think a hash would work though as we could log the input value in the logs so we don't necessarily need a diff constructible from the annotation as users (or us) can get it manually from the logs.

@matthchr
Copy link
Member

I believe that the above is actually what Terraform does - from what I can tell they issue a GET after a PUT and then use the result of that GET as the statefile.

They only (manually) extract certain fields from the GET response: presumably fields that they know are writable, thus ignoring all of the readonly fields.

@matthchr matthchr added this to the v2.1.0 milestone Apr 3, 2023
@theunrepentantgeek
Copy link
Member Author

We're still keen on doing this.

@petrk-75
Copy link

petrk-75 commented Aug 10, 2023

Is there any estimate when ASO will stop using PUT for reconciliation and will use the DIFF described above instead? Is there some guarantee it will get to 2.4.0? Any rough time schedule?

@kmitawojciech
Copy link

We're still keen on doing this.

Any specific plans when we could expect this feature?

@matthchr
Copy link
Member

We're still keen on doing this.

Any specific plans when we could expect this feature?

It's not currently that high on the list, mostly because it's quite difficult to implement generically. Can you expand on why you want/need it? AFAIK there are two key reasons it's interesting over just doing PUT:

  1. To reduce PUT load on the subscription and avoid throttling. We've made improvements to how often we PUT already though, and since then I haven't heard too many users run into throttling concerns.
  2. To avoid messing with resources whose PUT isn't idempotent, unless there is a change to be made. AFAIK very few resources fall into this category.

Does your desire fall into one of the above? Or is there another category of reason we're missing?

@petrk-75
Copy link

petrk-75 commented Aug 10, 2023

We are working on a solution which needs to be scalable. As soon as users/tenants create new services, we need to create new resources. The number of expected resources could be higher than the current ASO throughput. Having a limit of 300/1200 resources per subscription may not be sufficient. So there's neither an estimate nor a guarantee it will be part of 2.4.0, do I understand it right?

@matthchr
Copy link
Member

Ah, yes our FAQ is actually slightly out of date here: It says,

Azure subscriptions have a default limit of ~1200 PUTs

This is not actually strictly true. The limit is 1200 PUTs/hr per HTTP connection (well per frontend instance but HTTP connections are pinned to an instance so it basically boils down to that). We now have an HTTP client configured that uses multiple connections (see #2685), so the limit is higher than it used to be.

After we made the above changes, we haven't seen users actually hitting throttling in practice. That doesn't mean there isn't a limit (there is), but it's higher than 1200/hr by probably something like an order of magnitude. There are also limits on GETs so there's always going to be a maximum number of resources you can manage in a single subscription.

Don't necessarily let that FAQ entry scare you away - I've made a note to update it, but in reality between the improvements we've done and the ability to tune the azureSyncInterval you should have all of the tools you need to scale to a large number of resources today, even without this diffing logic.

That's not to say we aren't going to do this diffing stuff, but the approach we have now is actually pretty good from a "number of resources" perspective and so we're waiting for more signal to bump the importance of this up.

@matthchr
Copy link
Member

matthchr commented Dec 4, 2023

This is still something we're interested in, although given our throttling changes it doesn't seem as critical.

@theunrepentantgeek theunrepentantgeek modified the milestones: v2.6.0, v2.7.0 Dec 11, 2023
@matthchr matthchr removed this from the v2.7.0 milestone Feb 22, 2024
@matthchr
Copy link
Member

Still interested in this, it would still solve some theoretical problems, but with Azure's updated throttling that's being rolled out (or maybe is already out?) and the improvements we made last year and the year before it doesn't seem as urgent as it once did.

@theunrepentantgeek
Copy link
Member Author

Following up on this:

The limit is 1200 PUTs/hr per HTTP connection

ARM throughput limits have been increased significantly in 2024:

Throttling limits have increased by roughly 30 times for writes, 2.4 times for deletes, and 7.5 times for reads.

-- See Modernizing Azure Resource Manager Throttling

@kchudnovskiy
Copy link

kchudnovskiy commented Oct 1, 2024

The updated limits are higher in general, but the way throttling is being applied has also changed. For writes it is 200 with 10 second refill rate for a subscription and the same rate per tenant. Refill happens every second. There is an example which suggests the refill rate is 10.

For example, with a write bucket size of 200 tokens and a refill rate of 10 tokens per second, even after depleting the bucket with 200 API requests in a second, you can continue making 10 API requests per second.

The way I read it then, having multiple HTTP connection does not help and possibly makes things worse. A burst of requests would mean the initial pool is exhausted quicker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

5 participants