Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add AKS resource health to AzureManagedControlPlane #2738

Merged
merged 1 commit into from
Dec 16, 2022

Conversation

nojnhuh
Copy link
Contributor

@nojnhuh nojnhuh commented Oct 19, 2022

What type of PR is this?
/kind feature

What this PR does / why we need it: This PR adds Azure's Resource Health service and wires it up to mirror the current availability status for an AKS cluster in a condition of the status of its corresponding AzureManagedControlPlane.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Part of #2673

Special notes for your reviewer:
When testing this, I found that there's consistently an "Unknown" status immediately after the AKS cluster is fully provisioned citing an error connecting to the cluster's metrics server. This status goes to the expected "Available" after about 10 minutes.

The other way I found I could generate a non-Available status is to delete an underlying VMSS instance of a machine pool. After about 30 minutes, this status from the API starts being reflected:

{
  "id": "/subscriptions/[redacted secret capz-manager-bootstrap-credentials:subscription-id]/resourcegroups/jon-12709/providers/microsoft.containerservice/managedclusters/jon-12709/providers/Microsoft.ResourceHealth/availabilityStatuses/current",
  "name": "current",
  "type": "Microsoft.ResourceHealth/AvailabilityStatuses",
  "location": "eastus",
  "properties": {
    "availabilityState": "Degraded",
    "title": "Customer Node Power Off",
    "summary": "A node could not be found in the running/powered on state.",
    "reasonType": "Customer Initiated",
    "rootCauseAttributionTime": "2022-10-19T18:48:21.0367057Z",
    "resolutionETA": "2022-10-19T19:08:21Z",
    "reasonChronicity": "Persistent",
    "reportedTime": "2022-10-19T19:52:31.9720417Z"
  }
}

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests

Release note:

AKS resource health added to AzureManagedControlPlane status

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 19, 2022
@k8s-ci-robot
Copy link
Contributor

Hi @nojnhuh. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Oct 19, 2022
@nojnhuh
Copy link
Contributor Author

nojnhuh commented Oct 19, 2022

/area managedclusters

@k8s-ci-robot k8s-ci-robot added the area/managedclusters Issues related to managed AKS clusters created through the CAPZ ManagedCluster Type label Oct 19, 2022
@CecileRobertMichon
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 19, 2022
@nojnhuh nojnhuh force-pushed the aks-health branch 2 times, most recently from e3e6d09 to 13cfc23 Compare October 20, 2022 21:47
@nojnhuh
Copy link
Contributor Author

nojnhuh commented Oct 24, 2022

It looks like the e2e test is timing out when it's checking for the right number of replicas of the MachinePool to exist because it takes a while after the AKS instance is created for the Resource Health availability status to turn green. Disregarding that that's probably a bug in Azure, is the test failing in the "correct" place in this scenario?

The node pools themselves in the Azure portal look fully provisioned and in a successful state, but the lingering "Unknown" availability status is blocking the AzureManagedMachinePool reconciler from kicking in and recording that. In this case, an error like "machine pool node doesn't exist" isn't accurate, though in the other scenario I outlined in the PR description where I shut down an underlying VMSS instance, that error seems more reasonable. Is it worth doing some deeper inspection of the actual status to determine how it should be handled? I have a feeling that would make things considerably more messy.

And in the short-term, how should we compensate for the initial "Unavailable" status in the test? Would it be better to increase the timeout, wait until this is fixed in Azure, or somehow work around this in CAPZ?

@jackfrancis
Copy link
Contributor

@nojnhuh we might need to update our custom DiscoverAndWaitForAKSControlPlaneInitialized func that we use to block progress of the E2E tests on "cluster is up and ready!" status

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 24, 2022
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 24, 2022
test/e2e/aks.go Outdated
@@ -87,6 +87,12 @@ func DiscoverAndWaitForAKSControlPlaneInitialized(ctx context.Context, input Dis
})
Expect(controlPlane).NotTo(BeNil())

Logf("Waiting for 'Available' status from Azure Resource Health for control plane %s/%s", controlPlane.Namespace, controlPlane.Name)
WaitForControlPlaneResourceHealthAvailable(ctx, WaitForControlPlaneResourceHealthAvailableInput{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this considerably slow down the test or is it manageable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's considerable. About 15-20 minutes last I tried.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just repro'd this locally and it took 20 minutes.

The cluster in every other way seems fine. I'm unsure if this is something that will be user-impacting (obviously it's E2E-impacting and this additional timeout tolerance is appropriate).

What capi and/or capz CRDs are impacted by this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until the status turns green and the AzureManagedControlPlane's "Ready" condition turns true, the associated CAPI Cluster and MachinePools and CAPZ AzureManagedMachinePools will also not be marked ready. I'm not sure if any of that would prevents the cluster from actually being usable though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delaying those transitions to Ready by 20 minutes almost certainly has some knock-on effects. How about we turn this into a AzureManagedControlPlane bool configuration property that defaults to false? That way we can expose it to folks who definitely want it, and understand the tradeoffs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, that seems like a good compromise.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 5, 2022

// AzureAvailabilityStatusEnabled determines whether or not the AzureResourceAvailable status condition will be reported.
// +optional
AzureAvailabilityStatusEnabled bool `json:"azureAvailabilityStatusEnabled"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to the suggestion to add a toggle so this feature can be enabled by users who understand it for now until latency is fixed

However, we need to be very careful with adding API fields to implement temporary toggles like this: once it's in the API spec, we can't remove it without making a breaking change which requires a new API version. How about instead using a feature flag at the controller level, similar to the AKS feature flag itself?

The only downside there is that you would enable it for all your clusters at once, not per cluster. I don't know if that's really a downside though, since I can't really see a use case for someone wanting to enable resource health for some clusters but not others.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's bring this up in office hours this week and solicit feedback from the user community about how to implement this. I can definitely see a reason why a user would prefer their own "resource health" instrumentation (perhaps there are other vectors for latency given how much metrics tooling has the opportunity to touch every network packet). And so this would be a case where including it in the API does make sense (it would be a durable configuration interface).

But let's ask and see what users think.

cc @mtougeron @zmalik @sonasingh46 @luthermonson

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can definitely see a reason why a user would prefer their own "resource health" instrumentation

I don't fully understand this comment, why would one prevent you from doing the other? The Status field is just extra information, once AKS fixes the initial 10 minute initial API response of Unknown this would be useful information for any user, regardless of network stack or metric tooling as far as I can tell.

IMO a feature flag would be a better fit, in this case, at least to start, and then we can decide to move it into the API later if we see a need. Doing the reverse would be much harder and breaking. If we do want to move it into a durable API user exposed Spec field, we also want to do this with a durable API design, potentially add a new struct so we don't later end up with a flat list of toggles in the CRD Spec.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason to bring it into the API is to allow users to disable it. If we don't want this to be an opt-in/opt-out feature long-term than a feature flag while we investigate latency makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mtougeron @zmalik @sonasingh46 @luthermonson Any thoughts on the above? The tl;dr so far is:

  • Azure has a generic "Resource Health" service that AKS hooks into to report certain cluster issues via the Azure API.
  • The goal of this PR is to report this same info in the status of the AzureManagedControlPlane.
  • There's a quirk in Azure or AKS where a harmless (as far as we know) error gets reported for the first ~20 minutes after a cluster is fully provisioned, blocking the Ready condition on the AzureManagedControlPlane for that period of time. When we mentioned the delay issue to AKS, they thought 20 minutes is definitely worth looking into improving, but I got the sense that it would still likely end up being some single-digit number of minutes at best.

So the questions we're looking to answer are:

  • Is this feature something we need to have a way to disable?
  • If so, how should it be disabled? The two options presented so far are via the API in something like the AzureManagedControlPlane's spec.azureAvailabilityStatusEnabled or via a feature flag.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to the feature flag.

I suppose the right thing to do is to turn this off by default. @CecileRobertMichon do we have an existing pattern where we would introduce a feature flag default-enabled?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't have an existing one that on by default but the way you'd do that it to set the environment variable default value to true in manager.yaml https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/main/config/manager/manager.yaml#L26

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I'll plan to add the feature flag back. I think that would also give us a path toward moving the toggle into the API if we find that should be a more permanent option.

Is it worth still explicitly ignoring this health metrics error we see for every cluster created even when the feature is enabled?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My vote is that we launch this feature with the health metrics foo that is aware of failures during the initial n minutes of bootstrapping (a known side-effect of attaching an AKS cluster to Resource Health service), and thus ignores those errors until that reconciliation is observed (or a reasonably high timeout, 60 mins). As long as we document this then I think folks can make an informed decision about whether or not to enable the feature flag.

Discarding that foo, IMO, degrades the cluster experience overall and will discourage feature adoption. I think the tradeoff of discarding these known events for a limitied period of time outweighs the additional communication latency of other, non-Resource Health-related health metrics failures that may occur during that same window when we're waiting for the all clear to re-enable all metrics errors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 29, 2022
@nojnhuh
Copy link
Contributor Author

nojnhuh commented Nov 30, 2022

/retest

1 similar comment
@nojnhuh
Copy link
Contributor Author

nojnhuh commented Nov 30, 2022

/retest

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Dec 6, 2022

/retest

)

// SDKAvailabilityStatusToCondition converts an Azure Resource Health availability status to a status condition.
func SDKAvailabilityStatusToCondition(avail resourcehealth.AvailabilityStatus) *clusterv1.Condition {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty subjective, but I think a name like status might be a bit more descriptive of the variable than avail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I was trying to distinguish this with the status field on the resources, but something like availStatus might be a bit more clear while still keeping that distinction.

defer done()

resource := s.Scope.AvailabilityStatusResourceURI()
avail, err := s.GetByResource(ctx, resource)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above regarding the variable naming.

@jackfrancis
Copy link
Contributor

/retest

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Dec 13, 2022

/retest

1 similar comment
@nojnhuh
Copy link
Contributor Author

nojnhuh commented Dec 14, 2022

/retest

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Dec 14, 2022

/assign @jackfrancis

@jackfrancis jackfrancis added this to the v1.7 milestone Dec 15, 2022
@jackfrancis
Copy link
Contributor

/assign

@jackfrancis
Copy link
Contributor

overall lgtm, left a couple comments

@jackfrancis
Copy link
Contributor

Due to the added time that this feature requires (and our quickly growing list of E2E test scenarios), I opened this in test-infra to bump the test timeout to 4h (I think I'm bumping it? Maybe the default is higher, in any event for tests that we expect to run for a while I think it's more appropriate to declare an explicit, long timeout.)

kubernetes/test-infra#28256

Copy link
Contributor

@willie-yao willie-yao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Thanks for all the great work on this feature!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 15, 2022
@CecileRobertMichon
Copy link
Contributor

Test started last Tuesday at 11:57 AM passed after 1h43m12s

Due to the added time that this feature requires (and our quickly growing list of E2E test scenarios), I opened this in test-infra to bump the test timeout to 4h

That seems quite long for a required PR test. Right now it's optional and only runs for AKS exp changes but as we migrate the feature out of experimental and make this test required on every PR, we may want to think about how we can run a "lean" subset of the test in PR and keep the more lengthy (eg. upgrade) tests as periodic jobs + optional PR jobs that only trigger when the AKS codepaths get touched.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 15, 2022
Copy link
Contributor

@jackfrancis jackfrancis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 15, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jackfrancis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 15, 2022
@k8s-ci-robot
Copy link
Contributor

@nojnhuh: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-provider-azure-apiversion-upgrade 836c4fd link false /test pull-cluster-api-provider-azure-apiversion-upgrade

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/managedclusters Issues related to managed AKS clusters created through the CAPZ ManagedCluster Type cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants