-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reconcile should perform a diff with Azure #2811
Comments
Copied from earlier comment: Some examples of where this is done in other operators: |
Another thing I recently realized: Diffing will cause issues with Azure Policy, see: kubernetes-sigs/cluster-api-provider-azure#3009 (comment) |
Another idea that just occurred to me, documenting it here in case it's actually a good one (no guarantees): We could have a heuristic that reduced PUT load while at the same time not requiring the full investment and problems that come with client-side diffing. A good example of such a heuristic would be:
We could have manually crafted customizations per resource, or possibly automatically generate customizations from the Swagger to ignore fields like readonly times that are constantly changing and would always cause the payloads to mismatch. Although I actually don't think there are that many constantly changing fields in the APIs I have looked at. This approach has a lot of advantages:
If we like this general idea but don't love the hash (hard to show exactly what diffed w/ a hash), we could store a compressed or uncompressed raw annotation of the status JSON instead. (I say compressed because sticking the entire status into an annotation without shrinking it somehow might cause problems). |
I believe that the above is actually what Terraform does - from what I can tell they issue a GET after a PUT and then use the result of that GET as the statefile. They only (manually) extract certain fields from the GET response: presumably fields that they know are writable, thus ignoring all of the readonly fields. |
We're still keen on doing this. |
Is there any estimate when ASO will stop using PUT for reconciliation and will use the DIFF described above instead? Is there some guarantee it will get to 2.4.0? Any rough time schedule? |
Any specific plans when we could expect this feature? |
It's not currently that high on the list, mostly because it's quite difficult to implement generically. Can you expand on why you want/need it? AFAIK there are two key reasons it's interesting over just doing PUT:
Does your desire fall into one of the above? Or is there another category of reason we're missing? |
We are working on a solution which needs to be scalable. As soon as users/tenants create new services, we need to create new resources. The number of expected resources could be higher than the current ASO throughput. Having a limit of 300/1200 resources per subscription may not be sufficient. So there's neither an estimate nor a guarantee it will be part of 2.4.0, do I understand it right? |
Ah, yes our FAQ is actually slightly out of date here: It says,
This is not actually strictly true. The limit is 1200 PUTs/hr per HTTP connection (well per frontend instance but HTTP connections are pinned to an instance so it basically boils down to that). We now have an HTTP client configured that uses multiple connections (see #2685), so the limit is higher than it used to be. After we made the above changes, we haven't seen users actually hitting throttling in practice. That doesn't mean there isn't a limit (there is), but it's higher than 1200/hr by probably something like an order of magnitude. There are also limits on GETs so there's always going to be a maximum number of resources you can manage in a single subscription. Don't necessarily let that FAQ entry scare you away - I've made a note to update it, but in reality between the improvements we've done and the ability to tune the That's not to say we aren't going to do this diffing stuff, but the approach we have now is actually pretty good from a "number of resources" perspective and so we're waiting for more signal to bump the importance of this up. |
This is still something we're interested in, although given our throttling changes it doesn't seem as critical. |
Still interested in this, it would still solve some theoretical problems, but with Azure's updated throttling that's being rolled out (or maybe is already out?) and the improvements we made last year and the year before it doesn't seem as urgent as it once did. |
Following up on this:
ARM throughput limits have been increased significantly in 2024:
|
The updated limits are higher in general, but the way throttling is being applied has also changed. For writes it is 200 with 10 second refill rate for a subscription and the same rate per tenant. Refill happens every second. There is an example which suggests the refill rate is 10.
The way I read it then, having multiple HTTP connection does not help and possibly makes things worse. A burst of requests would mean the initial pool is exhausted quicker. |
At the moment we are doing a unilateral PUT of each resource when we reconcile; this works but has some drawbacks.
We should diff the current state of the resource with the spec and only do a PUT where required (see #2600 for potential design).
Follow-up to #1491 as much of that has been implemented already.
The text was updated successfully, but these errors were encountered: