Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AKS] "system:serviceaccount:workload-identity:azure-wi-webhook-admin" cannot list resource "mutatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope #777

Closed
pockyhe opened this issue Mar 6, 2023 · 29 comments
Labels
aks bug Something isn't working

Comments

@pockyhe
Copy link

pockyhe commented Mar 6, 2023

Describe the bug
We enable workload identity in our aks to access Azure Resources. It works well. suddenly we found our service cannot access azure resources today.
The only changes in this AKS is that we reinstall cert-manager 6 days argo. We don't know what caused this problem.

For fixing it, we helm reinstall it, it works well.

Logs
E0306 07:49:05.332724 1 reflector.go:138] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch admissionregistration.k8s.io/v1, Kind=MutatingWebhookConfiguration: failed to list admissionregistration.k8s.io/v1, Kind=MutatingWebhookConfiguration: mutatingwebhookconfigurations.admissionregistration.k8s.io is forbidden: User "system:serviceaccount:workload-identity:azure-wi-webhook-admin" cannot list resource "mutatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope
W0306 07:49:42.898523 1 reflector.go:324] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: failed to list admissionregistration.k8s.io/v1, Kind=MutatingWebhookConfiguration: mutatingwebhookconfigurations.admissionregistration.k8s.io is forbidden: User "system:serviceaccount:workload-identity:azure-wi-webhook-admin" cannot list resource "mutatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope
E0306 07:49:42.898557 1 reflector.go:138] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch admissionregistration.k8s.io/v1, Kind=MutatingWebhookConfiguration: failed to list admissionregistration.k8s.io/v1, Kind=MutatingWebhookConfiguration: mutatingwebhookconfigurations.admissionregistration.k8s.io is forbidden: User "system:serviceaccount:workload-identity:azure-wi-webhook-admin" cannot list resource "mutatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope
W0306 07:49:49.023471 1 reflector.go:324] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: failed to list *v1.ServiceAccount: serviceaccounts is forbidden: User "system:serviceaccount:workload-identity:azure-wi-webhook-admin" cannot list resource "serviceaccounts" in API group "" at the cluster scope
E0306 07:49:49.023505 1 reflector.go:138] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.ServiceAccount: failed to list *v1.ServiceAccount: serviceaccounts is forbidden: User "system:serviceaccount:workload-identity:azure-wi-webhook-admin" cannot list resource "serviceaccounts" in API group "" at the cluster scope
W0306 07:50:29.890010 1 reflector.go:324] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: failed to list *v1.ServiceAccount: serviceaccounts is forbidden: User "system:serviceaccount:workload-identity:azure-wi-webhook-admin" cannot list resource "serviceaccounts" in API group "" at the cluster scope
E0306 07:50:29.890049 1 reflector.go:138] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.ServiceAccount: failed to list *v1.ServiceAccount: serviceaccounts is forbidden: User "system:serviceaccount:workload-identity:azure-wi-webhook-admin" cannot list resource "serviceaccounts" in API group "" at the cluster scope
W0306 07:50:33.080530 1 reflector.go:324] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: failed to list admissionregistration.k8s.io/v1, Kind=MutatingWebhookConfiguration: mutatingwebhookconfigurations.admissionregistration.k8s.io is forbidden: User "system:serviceaccount:workload-identity:azure-wi-webhook-admin" cannot list resource "mutatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope
E0306 07:50:33.080567 1 reflector.go:138] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch admissionregistration.k8s.io/v1, Kind=MutatingWebhookConfiguration: failed to list admissionregistration.k8s.io/v1, Kind=MutatingWebhookConfiguration: mutatingwebhookconfigurations.admissionregistration.k8s.io is forbidden: User "system:serviceaccount:workload-identity:azure-wi-webhook-admin" cannot list resource "mutatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope

Environment

@pockyhe pockyhe added the bug Something isn't working label Mar 6, 2023
@fewkso
Copy link

fewkso commented Mar 6, 2023

Workload identity (chart 0.15.0 on aks 1.25.5) stopped working with the same errors and reinstalling it fixed the problem.

@aramase
Copy link
Member

aramase commented Mar 6, 2023

@fewkso @pockyhe How are you installing workload identity on AKS? Are you installing it using the helm charts from this repo or enabling the AKS add-on for workload identity? The error message seems to indicate the ClusterRole might have been deleted which caused the issue.

@karataliu Any rollout in AKS that could cause this issue?

@aramase aramase added the aks label Mar 6, 2023
@fewkso
Copy link

fewkso commented Mar 6, 2023

@aramase It was installed 16 days ago, right after the cluster install, with helm with only the azureTenantID as value. It worked for a while then this morning a deployment was failing, workload identity env vars weren't injected anymore. I ended up looking in wi webhook pods logs and there was these errors.

I have another cluster for the same client with 0.14.0 installed and there's no azure-wi-webhook-manager-role ClusterRole present and the webhook pods started logging the same errors.

Something seems to have deleted the ClusterRole and the ClusterRoleBinding, I did a helm template and recreated both resources and it fixed the webhook on the other cluster.

@adminnz
Copy link

adminnz commented Mar 6, 2023

I have the same issue, the ClusterRole and ClusterRoleBinding were somehow deleted, but I don't know how they were deleted.

I had to uninstall & reinstall the helm chart again.

This is rather worrying if the ClusterRole & ClusterRoleBindings are somehow getting deleted without user interaction.

@aramase
Copy link
Member

aramase commented Mar 6, 2023

The only possible entity that deleted the ClusterRole and ClusterRoleBinding is AKS here. I've tagged @karataliu to look into this from the AKS side.

@boney9
Copy link

boney9 commented Mar 6, 2023

We had a similar issue, and the way it was resolved was to re-install the helm chart.

@aramase
Copy link
Member

aramase commented Mar 7, 2023

This is a bug with the AKS add-on code that's causing deletion of ClusterRole and ClusterRoleBinding even when WI is installed using helm charts. They are working on the fix and this is the bug tracking it: Azure/AKS#3510. Please follow the issue for more details. @karataliu will update that issue with the ETA for the fix.

For anyone who comes across this issue, if your ClusterRole or ClusterRoleBinding is missing, the recommendation is to reinstall workload identity using the helm charts.

@aramase aramase changed the title ser "system:serviceaccount:workload-identity:azure-wi-webhook-admin" cannot list resource "mutatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope [AKS] "system:serviceaccount:workload-identity:azure-wi-webhook-admin" cannot list resource "mutatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope Mar 7, 2023
@karataliu
Copy link
Contributor

Thanks very much for reporting the issue.

AKS side is deploying the fix, ETA is ~1 day.

This issue will impact open source version of workload-identity.

Please reinstall workload-identity helm release or use AKS integrated workload identity as a fix.

@fewkso
Copy link

fewkso commented Mar 7, 2023

@aramase @karataliu Thanks to you guys for the quick investigation 😄

Also, I don't know how many AKS clusters use workload identity out there, but a lot of them will only exhibit the problem once pods are restarted and that could be in days or weeks. Would there be any way to email clients with clusters using workload identity about this issue? I know it's still in preview but it's already used in production in some cases.

@liamgib
Copy link

liamgib commented Mar 7, 2023

Thanks @karataliu / @aramase

+1 to @fewkso comment, we just spent hours debugging this and going back and forth w/ AKS Support until the issue was discovered. It's in preview, so happy to accept outages here and there but the deletion of critical resources without any notice is hard to accept.

@pockyhe
Copy link
Author

pockyhe commented Mar 7, 2023

@aramase @karataliu Thank all of you for investigation

@karataliu
Copy link
Contributor

@fewkso @liamgib Sorry about the inconvenience and much thanks for the kind reminder.
We have been actively working the notification, it is on the way. We'll keep improving on delivering the information in time.

@liamgib
Copy link

liamgib commented Mar 8, 2023

Thank you @karataliu, I appreciate the effort and transparency on rectify the issue.

@SOFSPEEL
Copy link

SOFSPEEL commented Mar 8, 2023

Having same problem, one thing I noticed about "workaround" is that killing the pods shown below is not enough, need to do helm uninstall and reinstall.

helm uninstall workload-identity-webhook -n blah

Wondering if just using an older version of the helm install (https://azure.github.io/azure-workload-identity/charts) is a better workaround, if so which version should I use.

image

@davejhahn
Copy link

Looking for some clarification here.... started running into this issue also and also came to the conclusion that re-installing workload-identity resolves the issue.

Clarification is on what the "fix" is. Is this something that is being fixed on the Azure side or a fix to workload identity to handle ?

Just trying to understand if anywhere we have no encountered this if we will need to re-install an updated version of workload identity or if it will resolve itself.

@fewkso
Copy link

fewkso commented Mar 8, 2023

@davejhahn @SOFSPEEL Workload identity webhook can be installed with the az command line without installing the helm chart.

From what I understand, a bug introduced in an AKS cleanup job was deleting the workload identity ClusterRole and ClusterRoleBinding if the "managed" Workload Identity webhook wasn't used.

Microsoft already fixed it on their side, but they won't undelete what they deleted.

Reinstalling the chart fixes the problem, but just recreating the ClusterRole and ClusterRoleBinding also works.

@davejhahn
Copy link

@fewkso ok thanks for the clarification. And yep, aware that it can be installed with azure cli. We have this all integrated into a terraform sub-module to create everything needed (and one part of that is deploying the helm chart). Sounds like at this point, it's fixed, but we should re-install workload identity in all our clusters (and restarting any deployments).

@liamgib
Copy link

liamgib commented Mar 8, 2023

I don't know if the fix has been applied to all regions yet, but just as an FYI the ClusterRole & ClusterRoleBinding's just got deleted from our AKS clusters again.

@pockyhe
Copy link
Author

pockyhe commented Mar 9, 2023

@liamgib @aramase I faced same situation. After having reinstalled it, this bug happened again today

@mlabrum
Copy link

mlabrum commented Mar 9, 2023

We've hit this issue twice today, with about 5 hours in-between on two clusters.

Why isn't Microsoft notifying customers? This is a huge issue.

@Houley01
Copy link

Houley01 commented Mar 9, 2023

We've hit this issue three times today, with about 5 hours in-between on one cluster.

And we are getting the error in another cluster using Azure workload identity, we are just luck we have yet to have any pods reboot I am guessing.

Why isn't Microsoft notifying us? This is a MAJOR issue.

@davejhahn
Copy link

@Houley01 I agree this is a critical issue.

If you think about it, every customer that uses workload identity outside of the managed version will run into this. It more or less breaks any application reliant on it (eventually).

Every customer that is in this scenario may or may not even be aware of the issue, and once they identify something is wrong, they have no idea what it is, and has to spend time researching it, and (hopefully) ending up here to read this thread.

But maybe they aren't finding this thread and pulling their hair out, and in the mean time, causing production outages, when a simple notification of a potential problem (and resolution) could be sent out.

I'd like to know why they are inside our clusters deleting things in the first place.

@fewkso
Copy link

fewkso commented Mar 9, 2023

Even though this feature is really nice and useful and people are eager to put it in place everywhere, it's still in preview, don't forget that.

The problem happened again on my clusters, so I just decided to switch to the managed version. Problem fixed 😄

@miwithro
Copy link

miwithro commented Mar 9, 2023

Hello,
We have rolled out for this issue which is included in the "v20230226" release. You can track the release for the particular region your cluster is in here: https://releases.aks.azure.com/#tabversion

@davejhahn
Copy link

@miwithro I'm assuming this means that the issue is still going to be occurring until the status indicates it is complete for the region applicable to us?

Once it indicates it is complete, is there anything (aside from addressing the workaround described above) that needs to be done on the AKS side on our end?

@mlabrum
Copy link

mlabrum commented Mar 9, 2023

For anyone saying this is a preview Microsoft have depreciated the currently supported version (aad-pod-identity) and the directed link to https://azure.github.io/azure-workload-identity/docs/ does not mention anything about this being in preview
The only message being when you check the git repository under support https://github.com/Azure/azure-workload-identity

IMPORTANT: As of Monday 10/24/2022, AAD Pod Identity is deprecated. As mentioned in the announcement, AAD Pod Identity has been replaced with Azure Workload Identity. Going forward, we will no longer add new features to this project in favor of Azure Workload Identity. We will continue to provide critical bug fixes until Azure Workload Identity reaches general availability. Following that, we will provide CVE patches until September 2023, at which time the project will be archived.

So anyone building new systems would choose to implement Azure Workload identity based on that information, this notice should mention that workload identity is not supported by Azure support and not to use it as its currently in preview.

That issue aside, Microsoft's AKS team deleted resources in clusters, it doesn't matter if its from a supported product or not, they actively deleted resources and should have communicated to AKS users saying that they may have deleted resources from their AKS environments, not just ignore the problem, it doesn't matter what was deleted, it matters that something was deleted without communication.

This does not give me confidence in the AKS environment if Microsoft may delete resources from our clusters without any sort of communication, this issue alone had us burning through a day of clients time.

@karataliu
Copy link
Contributor

karataliu commented Mar 10, 2023

Thanks everyone for the kind comments. Here are some updates.

  • Impact
    In a recent release, AKS service will trigger unexpectedly deletion of specific resource of workload identity open source project, including clusterrole and clusterrolebinding.
    Besides the release rollout, any update operation to the AKS cluster will also trigger the deletion. This should explain why the resource got deleted more than 1 time.

  • RCA:
    AKS side is working on cleaning up 'workload identity cluster role' when the AKS integrated workload identity is disabled state, the original intention was to clean up AKS integrated workload identity managed cluster role resource, but it missed the proper selector label. The result is the open source project cluster role resource also got deleted.
    The fix adds proper label for the deletion operation, so that it will only affect AKS integrated workload identity managed resources.

  • Fix timeline:

    • The fix completed in all regions on UTC 2023-03-09 07:35:00, it takes up to 24h to apply to existing clusters.
    • All new clusters would be not affected by the issue after above timestamp
    • All existing clusters would be not affected by the issue after up to 24h, i.e. UTC 2023-03-10 07:35:00
  • Action needed:
    Please reinstall the open source helm chart or use AKS integrated feature as a fix.

  • Notification:
    We started preparing the notification since initial detection of the issue. But the delivery is delayed due to some issue.
    From latest update, a notification mail with title Reinstall open source component or fix with AKS integration has been delivered to all affected subscription owners around UTC 2023-03-10 03:00:00. We'll keep improving on delivering the information in time.

Thank you again for taking the time to share your feedback with us. We appreciate your constructive comments, as they help us improve our products and services. We are always striving to deliver the best quality and value to our customers.

@lynkz-matt-psaltis
Copy link

Thanks @karataliu! Can I add it is hugely appreciated that this type of RCA information is provided here. It helps us to build trust with our clients that this is being resolved quickly by the experts on the problem! So thanks on behalf of our team and our clients! Thank you also to everyone who has contributed to this thread, we were able to get actionable information in a short time frame which for a GitHub issue is fantastic! How times have changed :D

@aramase
Copy link
Member

aramase commented Mar 13, 2023

Closing this issue with #777 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aks bug Something isn't working
Projects
None yet
Development

No branches or pull requests