add AKS resource health to AzureManagedControlPlane #2738

nojnhuh · 2022-10-19T20:16:50Z

What type of PR is this?
/kind feature

What this PR does / why we need it: This PR adds Azure's Resource Health service and wires it up to mirror the current availability status for an AKS cluster in a condition of the status of its corresponding AzureManagedControlPlane.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Part of #2673

Special notes for your reviewer:
When testing this, I found that there's consistently an "Unknown" status immediately after the AKS cluster is fully provisioned citing an error connecting to the cluster's metrics server. This status goes to the expected "Available" after about 10 minutes.

The other way I found I could generate a non-Available status is to delete an underlying VMSS instance of a machine pool. After about 30 minutes, this status from the API starts being reflected:

{
  "id": "/subscriptions/[redacted secret capz-manager-bootstrap-credentials:subscription-id]/resourcegroups/jon-12709/providers/microsoft.containerservice/managedclusters/jon-12709/providers/Microsoft.ResourceHealth/availabilityStatuses/current",
  "name": "current",
  "type": "Microsoft.ResourceHealth/AvailabilityStatuses",
  "location": "eastus",
  "properties": {
    "availabilityState": "Degraded",
    "title": "Customer Node Power Off",
    "summary": "A node could not be found in the running/powered on state.",
    "reasonType": "Customer Initiated",
    "rootCauseAttributionTime": "2022-10-19T18:48:21.0367057Z",
    "resolutionETA": "2022-10-19T19:08:21Z",
    "reasonChronicity": "Persistent",
    "reportedTime": "2022-10-19T19:52:31.9720417Z"
  }
}

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

squashed commits
includes documentation
adds unit tests

Release note:

AKS resource health added to AzureManagedControlPlane status

k8s-ci-robot · 2022-10-19T20:16:59Z

Hi @nojnhuh. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

nojnhuh · 2022-10-19T20:17:15Z

/area managedclusters

CecileRobertMichon · 2022-10-19T21:17:38Z

/ok-to-test

nojnhuh · 2022-10-24T16:15:31Z

It looks like the e2e test is timing out when it's checking for the right number of replicas of the MachinePool to exist because it takes a while after the AKS instance is created for the Resource Health availability status to turn green. Disregarding that that's probably a bug in Azure, is the test failing in the "correct" place in this scenario?

The node pools themselves in the Azure portal look fully provisioned and in a successful state, but the lingering "Unknown" availability status is blocking the AzureManagedMachinePool reconciler from kicking in and recording that. In this case, an error like "machine pool node doesn't exist" isn't accurate, though in the other scenario I outlined in the PR description where I shut down an underlying VMSS instance, that error seems more reasonable. Is it worth doing some deeper inspection of the actual status to determine how it should be handled? I have a feeling that would make things considerably more messy.

And in the short-term, how should we compensate for the initial "Unavailable" status in the test? Would it be better to increase the timeout, wait until this is fixed in Azure, or somehow work around this in CAPZ?

jackfrancis · 2022-10-24T16:26:52Z

@nojnhuh we might need to update our custom DiscoverAndWaitForAKSControlPlaneInitialized func that we use to block progress of the E2E tests on "cluster is up and ready!" status

CecileRobertMichon · 2022-11-01T20:27:41Z

test/e2e/aks.go

@@ -87,6 +87,12 @@ func DiscoverAndWaitForAKSControlPlaneInitialized(ctx context.Context, input Dis
 	})
 	Expect(controlPlane).NotTo(BeNil())

+	Logf("Waiting for 'Available' status from Azure Resource Health for control plane %s/%s", controlPlane.Namespace, controlPlane.Name)
+	WaitForControlPlaneResourceHealthAvailable(ctx, WaitForControlPlaneResourceHealthAvailableInput{


Does this considerably slow down the test or is it manageable?

It's considerable. About 15-20 minutes last I tried.

I just repro'd this locally and it took 20 minutes.

The cluster in every other way seems fine. I'm unsure if this is something that will be user-impacting (obviously it's E2E-impacting and this additional timeout tolerance is appropriate).

What capi and/or capz CRDs are impacted by this change?

Until the status turns green and the AzureManagedControlPlane's "Ready" condition turns true, the associated CAPI Cluster and MachinePools and CAPZ AzureManagedMachinePools will also not be marked ready. I'm not sure if any of that would prevents the cluster from actually being usable though.

Delaying those transitions to Ready by 20 minutes almost certainly has some knock-on effects. How about we turn this into a AzureManagedControlPlane bool configuration property that defaults to false? That way we can expose it to folks who definitely want it, and understand the tradeoffs.

I agree, that seems like a good compromise.

CecileRobertMichon · 2022-11-07T17:28:58Z

exp/api/v1beta1/azuremanagedcontrolplane_types.go

+
+	// AzureAvailabilityStatusEnabled determines whether or not the AzureResourceAvailable status condition will be reported.
+	// +optional
+	AzureAvailabilityStatusEnabled bool `json:"azureAvailabilityStatusEnabled"`


+1 to the suggestion to add a toggle so this feature can be enabled by users who understand it for now until latency is fixed

However, we need to be very careful with adding API fields to implement temporary toggles like this: once it's in the API spec, we can't remove it without making a breaking change which requires a new API version. How about instead using a feature flag at the controller level, similar to the AKS feature flag itself?

The only downside there is that you would enable it for all your clusters at once, not per cluster. I don't know if that's really a downside though, since I can't really see a use case for someone wanting to enable resource health for some clusters but not others.

Let's bring this up in office hours this week and solicit feedback from the user community about how to implement this. I can definitely see a reason why a user would prefer their own "resource health" instrumentation (perhaps there are other vectors for latency given how much metrics tooling has the opportunity to touch every network packet). And so this would be a case where including it in the API does make sense (it would be a durable configuration interface).

But let's ask and see what users think.

cc @mtougeron @zmalik @sonasingh46 @luthermonson

I can definitely see a reason why a user would prefer their own "resource health" instrumentation

I don't fully understand this comment, why would one prevent you from doing the other? The Status field is just extra information, once AKS fixes the initial 10 minute initial API response of Unknown this would be useful information for any user, regardless of network stack or metric tooling as far as I can tell.

IMO a feature flag would be a better fit, in this case, at least to start, and then we can decide to move it into the API later if we see a need. Doing the reverse would be much harder and breaking. If we do want to move it into a durable API user exposed Spec field, we also want to do this with a durable API design, potentially add a new struct so we don't later end up with a flat list of toggles in the CRD Spec.

The reason to bring it into the API is to allow users to disable it. If we don't want this to be an opt-in/opt-out feature long-term than a feature flag while we investigate latency makes sense.

@mtougeron @zmalik @sonasingh46 @luthermonson Any thoughts on the above? The tl;dr so far is:

Azure has a generic "Resource Health" service that AKS hooks into to report certain cluster issues via the Azure API.

The goal of this PR is to report this same info in the status of the AzureManagedControlPlane.

There's a quirk in Azure or AKS where a harmless (as far as we know) error gets reported for the first ~20 minutes after a cluster is fully provisioned, blocking the Ready condition on the AzureManagedControlPlane for that period of time. When we mentioned the delay issue to AKS, they thought 20 minutes is definitely worth looking into improving, but I got the sense that it would still likely end up being some single-digit number of minutes at best.

So the questions we're looking to answer are:

Is this feature something we need to have a way to disable?

If so, how should it be disabled? The two options presented so far are via the API in something like the AzureManagedControlPlane's spec.azureAvailabilityStatusEnabled or via a feature flag.

+1 to the feature flag.

I suppose the right thing to do is to turn this off by default. @CecileRobertMichon do we have an existing pattern where we would introduce a feature flag default-enabled?

we don't have an existing one that on by default but the way you'd do that it to set the environment variable default value to true in manager.yaml https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/main/config/manager/manager.yaml#L26

Sounds good, I'll plan to add the feature flag back. I think that would also give us a path toward moving the toggle into the API if we find that should be a more permanent option.

Is it worth still explicitly ignoring this health metrics error we see for every cluster created even when the feature is enabled?

My vote is that we launch this feature with the health metrics foo that is aware of failures during the initial n minutes of bootstrapping (a known side-effect of attaching an AKS cluster to Resource Health service), and thus ignores those errors until that reconciliation is observed (or a reasonably high timeout, 60 mins). As long as we document this then I think folks can make an informed decision about whether or not to enable the feature flag.

Discarding that foo, IMO, degrades the cluster experience overall and will discourage feature adoption. I think the tradeoff of discarding these known events for a limitied period of time outweighs the additional communication latency of other, non-Resource Health-related health metrics failures that may occur during that same window when we're waiting for the all clear to re-enable all metrics errors.

nojnhuh · 2022-11-30T17:26:39Z

/retest

nojnhuh · 2022-11-30T22:00:14Z

/retest

nojnhuh · 2022-12-06T21:35:27Z

/retest

willie-yao · 2022-12-06T19:28:55Z

azure/converters/resourcehealth.go

+)
+
+// SDKAvailabilityStatusToCondition converts an Azure Resource Health availability status to a status condition.
+func SDKAvailabilityStatusToCondition(avail resourcehealth.AvailabilityStatus) *clusterv1.Condition {


This is pretty subjective, but I think a name like status might be a bit more descriptive of the variable than avail.

I think I was trying to distinguish this with the status field on the resources, but something like availStatus might be a bit more clear while still keeping that distinction.

willie-yao · 2022-12-06T19:51:48Z

azure/services/resourcehealth/resourcehealth.go

+	defer done()
+
+	resource := s.Scope.AvailabilityStatusResourceURI()
+	avail, err := s.GetByResource(ctx, resource)


Same comment as above regarding the variable naming.

jackfrancis · 2022-12-13T00:23:50Z

/retest

nojnhuh · 2022-12-13T19:57:37Z

/retest

nojnhuh · 2022-12-14T05:43:57Z

/retest

nojnhuh · 2022-12-14T23:09:10Z

/assign @jackfrancis

jackfrancis · 2022-12-15T17:32:53Z

/assign

azure/converters/resourcehealth.go

azure/scope/managedcontrolplane.go

jackfrancis · 2022-12-15T18:15:41Z

overall lgtm, left a couple comments

jackfrancis · 2022-12-15T18:17:12Z

Due to the added time that this feature requires (and our quickly growing list of E2E test scenarios), I opened this in test-infra to bump the test timeout to 4h (I think I'm bumping it? Maybe the default is higher, in any event for tests that we expect to run for a while I think it's more appropriate to declare an explicit, long timeout.)

kubernetes/test-infra#28256

willie-yao

/lgtm

Thanks for all the great work on this feature!

CecileRobertMichon · 2022-12-15T20:10:39Z

Test started last Tuesday at 11:57 AM passed after 1h43m12s

Due to the added time that this feature requires (and our quickly growing list of E2E test scenarios), I opened this in test-infra to bump the test timeout to 4h

That seems quite long for a required PR test. Right now it's optional and only runs for AKS exp changes but as we migrate the feature out of experimental and make this test required on every PR, we may want to think about how we can run a "lean" subset of the test in PR and keep the more lengthy (eg. upgrade) tests as periodic jobs + optional PR jobs that only trigger when the AKS codepaths get touched.

jackfrancis

/lgtm
/approve

k8s-ci-robot · 2022-12-15T21:00:05Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jackfrancis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jackfrancis]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2022-12-15T21:41:43Z

@nojnhuh: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-cluster-api-provider-azure-apiversion-upgrade	`836c4fd`	link	false	`/test pull-cluster-api-provider-azure-apiversion-upgrade`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Oct 19, 2022

k8s-ci-robot requested review from alexeldeib and jackfrancis October 19, 2022 20:17

k8s-ci-robot added the area/managedclusters Issues related to managed AKS clusters created through the CAPZ ManagedCluster Type label Oct 19, 2022

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 19, 2022

nojnhuh force-pushed the aks-health branch 2 times, most recently from e3e6d09 to 13cfc23 Compare October 20, 2022 21:47

nojnhuh force-pushed the aks-health branch from 13cfc23 to 03a7839 Compare October 24, 2022 21:04

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 24, 2022

nojnhuh force-pushed the aks-health branch from 03a7839 to 9a9998c Compare October 24, 2022 21:06

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 24, 2022

CecileRobertMichon reviewed Nov 1, 2022

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 5, 2022

CecileRobertMichon reviewed Nov 7, 2022

View reviewed changes

nojnhuh force-pushed the aks-health branch from aacf0d9 to 9c0edb6 Compare November 29, 2022 23:14

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 29, 2022

nojnhuh force-pushed the aks-health branch from 9c0edb6 to cfeb97a Compare November 30, 2022 06:54

nojnhuh force-pushed the aks-health branch from b1e8c16 to fa35b43 Compare December 5, 2022 17:17

willie-yao reviewed Dec 6, 2022

View reviewed changes

nojnhuh force-pushed the aks-health branch from 5eb7f6e to 9d4b0a8 Compare December 13, 2022 16:19

k8s-ci-robot assigned jackfrancis Dec 14, 2022

jackfrancis added this to the v1.7 milestone Dec 15, 2022

jackfrancis reviewed Dec 15, 2022

View reviewed changes

azure/converters/resourcehealth.go Outdated Show resolved Hide resolved

jackfrancis reviewed Dec 15, 2022

View reviewed changes

azure/scope/managedcontrolplane.go Outdated Show resolved Hide resolved

willie-yao approved these changes Dec 15, 2022

View reviewed changes

k8s-ci-robot assigned willie-yao Dec 15, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 15, 2022

add AKS resource health to AzureManagedControlPlane

836c4fd

nojnhuh force-pushed the aks-health branch from 9d4b0a8 to 836c4fd Compare December 15, 2022 20:36

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 15, 2022

jackfrancis approved these changes Dec 15, 2022

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 15, 2022

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 15, 2022

k8s-ci-robot merged commit 63e88d4 into kubernetes-sigs:main Dec 16, 2022

nojnhuh deleted the aks-health branch December 16, 2022 06:10

dtzar mentioned this pull request Jan 9, 2023

Add resource health in AzureManagedCluster status #2673

Closed

CecileRobertMichon mentioned this pull request Jan 9, 2023

Add ServiceHealth to AzureManagedCluster Status #3001

Closed

add AKS resource health to AzureManagedControlPlane #2738

add AKS resource health to AzureManagedControlPlane #2738

Conversation

nojnhuh commented Oct 19, 2022

k8s-ci-robot commented Oct 19, 2022

nojnhuh commented Oct 19, 2022

CecileRobertMichon commented Oct 19, 2022

nojnhuh commented Oct 24, 2022

jackfrancis commented Oct 24, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nojnhuh commented Nov 30, 2022

nojnhuh commented Nov 30, 2022

nojnhuh commented Dec 6, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jackfrancis commented Dec 13, 2022

nojnhuh commented Dec 13, 2022

nojnhuh commented Dec 14, 2022

nojnhuh commented Dec 14, 2022

jackfrancis commented Dec 15, 2022

jackfrancis commented Dec 15, 2022

jackfrancis commented Dec 15, 2022

willie-yao left a comment

Choose a reason for hiding this comment

CecileRobertMichon commented Dec 15, 2022

jackfrancis left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Dec 15, 2022

k8s-ci-robot commented Dec 15, 2022