-
Notifications
You must be signed in to change notification settings - Fork 431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add AKS resource health to AzureManagedControlPlane #2738
Conversation
Hi @nojnhuh. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/area managedclusters |
/ok-to-test |
e3e6d09
to
13cfc23
Compare
It looks like the e2e test is timing out when it's checking for the right number of replicas of the MachinePool to exist because it takes a while after the AKS instance is created for the Resource Health availability status to turn green. Disregarding that that's probably a bug in Azure, is the test failing in the "correct" place in this scenario? The node pools themselves in the Azure portal look fully provisioned and in a successful state, but the lingering "Unknown" availability status is blocking the AzureManagedMachinePool reconciler from kicking in and recording that. In this case, an error like "machine pool node doesn't exist" isn't accurate, though in the other scenario I outlined in the PR description where I shut down an underlying VMSS instance, that error seems more reasonable. Is it worth doing some deeper inspection of the actual status to determine how it should be handled? I have a feeling that would make things considerably more messy. And in the short-term, how should we compensate for the initial "Unavailable" status in the test? Would it be better to increase the timeout, wait until this is fixed in Azure, or somehow work around this in CAPZ? |
@nojnhuh we might need to update our custom |
test/e2e/aks.go
Outdated
@@ -87,6 +87,12 @@ func DiscoverAndWaitForAKSControlPlaneInitialized(ctx context.Context, input Dis | |||
}) | |||
Expect(controlPlane).NotTo(BeNil()) | |||
|
|||
Logf("Waiting for 'Available' status from Azure Resource Health for control plane %s/%s", controlPlane.Namespace, controlPlane.Name) | |||
WaitForControlPlaneResourceHealthAvailable(ctx, WaitForControlPlaneResourceHealthAvailableInput{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this considerably slow down the test or is it manageable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's considerable. About 15-20 minutes last I tried.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just repro'd this locally and it took 20 minutes.
The cluster in every other way seems fine. I'm unsure if this is something that will be user-impacting (obviously it's E2E-impacting and this additional timeout tolerance is appropriate).
What capi and/or capz CRDs are impacted by this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Until the status turns green and the AzureManagedControlPlane's "Ready" condition turns true, the associated CAPI Cluster and MachinePools and CAPZ AzureManagedMachinePools will also not be marked ready. I'm not sure if any of that would prevents the cluster from actually being usable though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delaying those transitions to Ready by 20 minutes almost certainly has some knock-on effects. How about we turn this into a AzureManagedControlPlane
bool configuration property that defaults to false? That way we can expose it to folks who definitely want it, and understand the tradeoffs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, that seems like a good compromise.
|
||
// AzureAvailabilityStatusEnabled determines whether or not the AzureResourceAvailable status condition will be reported. | ||
// +optional | ||
AzureAvailabilityStatusEnabled bool `json:"azureAvailabilityStatusEnabled"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to the suggestion to add a toggle so this feature can be enabled by users who understand it for now until latency is fixed
However, we need to be very careful with adding API fields to implement temporary toggles like this: once it's in the API spec, we can't remove it without making a breaking change which requires a new API version. How about instead using a feature flag at the controller level, similar to the AKS feature flag itself?
The only downside there is that you would enable it for all your clusters at once, not per cluster. I don't know if that's really a downside though, since I can't really see a use case for someone wanting to enable resource health for some clusters but not others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's bring this up in office hours this week and solicit feedback from the user community about how to implement this. I can definitely see a reason why a user would prefer their own "resource health" instrumentation (perhaps there are other vectors for latency given how much metrics tooling has the opportunity to touch every network packet). And so this would be a case where including it in the API does make sense (it would be a durable configuration interface).
But let's ask and see what users think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can definitely see a reason why a user would prefer their own "resource health" instrumentation
I don't fully understand this comment, why would one prevent you from doing the other? The Status
field is just extra information, once AKS fixes the initial 10 minute initial API response of Unknown
this would be useful information for any user, regardless of network stack or metric tooling as far as I can tell.
IMO a feature flag would be a better fit, in this case, at least to start, and then we can decide to move it into the API later if we see a need. Doing the reverse would be much harder and breaking. If we do want to move it into a durable API user exposed Spec field, we also want to do this with a durable API design, potentially add a new struct so we don't later end up with a flat list of toggles in the CRD Spec
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason to bring it into the API is to allow users to disable it. If we don't want this to be an opt-in/opt-out feature long-term than a feature flag while we investigate latency makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mtougeron @zmalik @sonasingh46 @luthermonson Any thoughts on the above? The tl;dr so far is:
- Azure has a generic "Resource Health" service that AKS hooks into to report certain cluster issues via the Azure API.
- The goal of this PR is to report this same info in the
status
of the AzureManagedControlPlane. - There's a quirk in Azure or AKS where a harmless (as far as we know) error gets reported for the first ~20 minutes after a cluster is fully provisioned, blocking the
Ready
condition on the AzureManagedControlPlane for that period of time. When we mentioned the delay issue to AKS, they thought 20 minutes is definitely worth looking into improving, but I got the sense that it would still likely end up being some single-digit number of minutes at best.
So the questions we're looking to answer are:
- Is this feature something we need to have a way to disable?
- If so, how should it be disabled? The two options presented so far are via the API in something like the AzureManagedControlPlane's
spec.azureAvailabilityStatusEnabled
or via a feature flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to the feature flag.
I suppose the right thing to do is to turn this off by default. @CecileRobertMichon do we have an existing pattern where we would introduce a feature flag default-enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't have an existing one that on by default but the way you'd do that it to set the environment variable default value to true in manager.yaml https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/main/config/manager/manager.yaml#L26
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I'll plan to add the feature flag back. I think that would also give us a path toward moving the toggle into the API if we find that should be a more permanent option.
Is it worth still explicitly ignoring this health metrics error we see for every cluster created even when the feature is enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My vote is that we launch this feature with the health metrics foo that is aware of failures during the initial n minutes of bootstrapping (a known side-effect of attaching an AKS cluster to Resource Health service), and thus ignores those errors until that reconciliation is observed (or a reasonably high timeout, 60 mins). As long as we document this then I think folks can make an informed decision about whether or not to enable the feature flag.
Discarding that foo, IMO, degrades the cluster experience overall and will discourage feature adoption. I think the tradeoff of discarding these known events for a limitied period of time outweighs the additional communication latency of other, non-Resource Health-related health metrics failures that may occur during that same window when we're waiting for the all clear to re-enable all metrics errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
/retest |
1 similar comment
/retest |
/retest |
azure/converters/resourcehealth.go
Outdated
) | ||
|
||
// SDKAvailabilityStatusToCondition converts an Azure Resource Health availability status to a status condition. | ||
func SDKAvailabilityStatusToCondition(avail resourcehealth.AvailabilityStatus) *clusterv1.Condition { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty subjective, but I think a name like status
might be a bit more descriptive of the variable than avail
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I was trying to distinguish this with the status
field on the resources, but something like availStatus
might be a bit more clear while still keeping that distinction.
defer done() | ||
|
||
resource := s.Scope.AvailabilityStatusResourceURI() | ||
avail, err := s.GetByResource(ctx, resource) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above regarding the variable naming.
/retest |
/retest |
1 similar comment
/retest |
/assign @jackfrancis |
/assign |
overall lgtm, left a couple comments |
Due to the added time that this feature requires (and our quickly growing list of E2E test scenarios), I opened this in test-infra to bump the test timeout to 4h (I think I'm bumping it? Maybe the default is higher, in any event for tests that we expect to run for a while I think it's more appropriate to declare an explicit, long timeout.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Thanks for all the great work on this feature!
That seems quite long for a required PR test. Right now it's optional and only runs for AKS exp changes but as we migrate the feature out of experimental and make this test required on every PR, we may want to think about how we can run a "lean" subset of the test in PR and keep the more lengthy (eg. upgrade) tests as periodic jobs + optional PR jobs that only trigger when the AKS codepaths get touched. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jackfrancis The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@nojnhuh: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
What type of PR is this?
/kind feature
What this PR does / why we need it: This PR adds Azure's Resource Health service and wires it up to mirror the current availability status for an AKS cluster in a condition of the
status
of its corresponding AzureManagedControlPlane.Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Part of #2673
Special notes for your reviewer:
When testing this, I found that there's consistently an "Unknown" status immediately after the AKS cluster is fully provisioned citing an error connecting to the cluster's metrics server. This status goes to the expected "Available" after about 10 minutes.
The other way I found I could generate a non-Available status is to delete an underlying VMSS instance of a machine pool. After about 30 minutes, this status from the API starts being reflected:
Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.
TODOs:
Release note: