-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AKS] "system:serviceaccount:workload-identity:azure-wi-webhook-admin" cannot list resource "mutatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope #777
Comments
Workload identity (chart 0.15.0 on aks 1.25.5) stopped working with the same errors and reinstalling it fixed the problem. |
@fewkso @pockyhe How are you installing workload identity on AKS? Are you installing it using the helm charts from this repo or enabling the AKS add-on for workload identity? The error message seems to indicate the @karataliu Any rollout in AKS that could cause this issue? |
@aramase It was installed 16 days ago, right after the cluster install, with helm with only the azureTenantID as value. It worked for a while then this morning a deployment was failing, workload identity env vars weren't injected anymore. I ended up looking in wi webhook pods logs and there was these errors. I have another cluster for the same client with 0.14.0 installed and there's no Something seems to have deleted the |
I have the same issue, the ClusterRole and ClusterRoleBinding were somehow deleted, but I don't know how they were deleted. I had to uninstall & reinstall the helm chart again. This is rather worrying if the ClusterRole & ClusterRoleBindings are somehow getting deleted without user interaction. |
The only possible entity that deleted the |
We had a similar issue, and the way it was resolved was to re-install the helm chart. |
This is a bug with the AKS add-on code that's causing deletion of For anyone who comes across this issue, if your |
Thanks very much for reporting the issue. AKS side is deploying the fix, ETA is ~1 day. This issue will impact open source version of workload-identity. Please reinstall workload-identity helm release or use AKS integrated workload identity as a fix. |
@aramase @karataliu Thanks to you guys for the quick investigation 😄 Also, I don't know how many AKS clusters use workload identity out there, but a lot of them will only exhibit the problem once pods are restarted and that could be in days or weeks. Would there be any way to email clients with clusters using workload identity about this issue? I know it's still in preview but it's already used in production in some cases. |
Thanks @karataliu / @aramase +1 to @fewkso comment, we just spent hours debugging this and going back and forth w/ AKS Support until the issue was discovered. It's in preview, so happy to accept outages here and there but the deletion of critical resources without any notice is hard to accept. |
@aramase @karataliu Thank all of you for investigation |
Thank you @karataliu, I appreciate the effort and transparency on rectify the issue. |
Having same problem, one thing I noticed about "workaround" is that killing the pods shown below is not enough, need to do helm uninstall and reinstall. helm uninstall workload-identity-webhook -n blah Wondering if just using an older version of the helm install (https://azure.github.io/azure-workload-identity/charts) is a better workaround, if so which version should I use. |
Looking for some clarification here.... started running into this issue also and also came to the conclusion that re-installing workload-identity resolves the issue. Clarification is on what the "fix" is. Is this something that is being fixed on the Azure side or a fix to workload identity to handle ? Just trying to understand if anywhere we have no encountered this if we will need to re-install an updated version of workload identity or if it will resolve itself. |
@davejhahn @SOFSPEEL Workload identity webhook can be installed with the az command line without installing the helm chart. From what I understand, a bug introduced in an AKS cleanup job was deleting the workload identity Microsoft already fixed it on their side, but they won't undelete what they deleted. Reinstalling the chart fixes the problem, but just recreating the |
@fewkso ok thanks for the clarification. And yep, aware that it can be installed with azure cli. We have this all integrated into a terraform sub-module to create everything needed (and one part of that is deploying the helm chart). Sounds like at this point, it's fixed, but we should re-install workload identity in all our clusters (and restarting any deployments). |
I don't know if the fix has been applied to all regions yet, but just as an FYI the ClusterRole & ClusterRoleBinding's just got deleted from our AKS clusters again. |
We've hit this issue twice today, with about 5 hours in-between on two clusters. Why isn't Microsoft notifying customers? This is a huge issue. |
We've hit this issue three times today, with about 5 hours in-between on one cluster. And we are getting the error in another cluster using Azure workload identity, we are just luck we have yet to have any pods reboot I am guessing. Why isn't Microsoft notifying us? This is a MAJOR issue. |
@Houley01 I agree this is a critical issue. If you think about it, every customer that uses workload identity outside of the managed version will run into this. It more or less breaks any application reliant on it (eventually). Every customer that is in this scenario may or may not even be aware of the issue, and once they identify something is wrong, they have no idea what it is, and has to spend time researching it, and (hopefully) ending up here to read this thread. But maybe they aren't finding this thread and pulling their hair out, and in the mean time, causing production outages, when a simple notification of a potential problem (and resolution) could be sent out. I'd like to know why they are inside our clusters deleting things in the first place. |
Even though this feature is really nice and useful and people are eager to put it in place everywhere, it's still in preview, don't forget that. The problem happened again on my clusters, so I just decided to switch to the managed version. Problem fixed 😄 |
Hello, |
@miwithro I'm assuming this means that the issue is still going to be occurring until the status indicates it is complete for the region applicable to us? Once it indicates it is complete, is there anything (aside from addressing the workaround described above) that needs to be done on the AKS side on our end? |
For anyone saying this is a preview Microsoft have depreciated the currently supported version (aad-pod-identity) and the directed link to https://azure.github.io/azure-workload-identity/docs/ does not mention anything about this being in preview IMPORTANT: As of Monday 10/24/2022, AAD Pod Identity is deprecated. As mentioned in the announcement, AAD Pod Identity has been replaced with Azure Workload Identity. Going forward, we will no longer add new features to this project in favor of Azure Workload Identity. We will continue to provide critical bug fixes until Azure Workload Identity reaches general availability. Following that, we will provide CVE patches until September 2023, at which time the project will be archived. So anyone building new systems would choose to implement Azure Workload identity based on that information, this notice should mention that workload identity is not supported by Azure support and not to use it as its currently in preview. That issue aside, Microsoft's AKS team deleted resources in clusters, it doesn't matter if its from a supported product or not, they actively deleted resources and should have communicated to AKS users saying that they may have deleted resources from their AKS environments, not just ignore the problem, it doesn't matter what was deleted, it matters that something was deleted without communication. This does not give me confidence in the AKS environment if Microsoft may delete resources from our clusters without any sort of communication, this issue alone had us burning through a day of clients time. |
Thanks everyone for the kind comments. Here are some updates.
Thank you again for taking the time to share your feedback with us. We appreciate your constructive comments, as they help us improve our products and services. We are always striving to deliver the best quality and value to our customers. |
Thanks @karataliu! Can I add it is hugely appreciated that this type of RCA information is provided here. It helps us to build trust with our clients that this is being resolved quickly by the experts on the problem! So thanks on behalf of our team and our clients! Thank you also to everyone who has contributed to this thread, we were able to get actionable information in a short time frame which for a GitHub issue is fantastic! How times have changed :D |
Closing this issue with #777 (comment) |
Describe the bug
We enable workload identity in our aks to access Azure Resources. It works well. suddenly we found our service cannot access azure resources today.
The only changes in this AKS is that we reinstall cert-manager 6 days argo. We don't know what caused this problem.
For fixing it, we helm reinstall it, it works well.
Logs
E0306 07:49:05.332724 1 reflector.go:138] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch admissionregistration.k8s.io/v1, Kind=MutatingWebhookConfiguration: failed to list admissionregistration.k8s.io/v1, Kind=MutatingWebhookConfiguration: mutatingwebhookconfigurations.admissionregistration.k8s.io is forbidden: User "system:serviceaccount:workload-identity:azure-wi-webhook-admin" cannot list resource "mutatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope
W0306 07:49:42.898523 1 reflector.go:324] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: failed to list admissionregistration.k8s.io/v1, Kind=MutatingWebhookConfiguration: mutatingwebhookconfigurations.admissionregistration.k8s.io is forbidden: User "system:serviceaccount:workload-identity:azure-wi-webhook-admin" cannot list resource "mutatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope
E0306 07:49:42.898557 1 reflector.go:138] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch admissionregistration.k8s.io/v1, Kind=MutatingWebhookConfiguration: failed to list admissionregistration.k8s.io/v1, Kind=MutatingWebhookConfiguration: mutatingwebhookconfigurations.admissionregistration.k8s.io is forbidden: User "system:serviceaccount:workload-identity:azure-wi-webhook-admin" cannot list resource "mutatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope
W0306 07:49:49.023471 1 reflector.go:324] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: failed to list *v1.ServiceAccount: serviceaccounts is forbidden: User "system:serviceaccount:workload-identity:azure-wi-webhook-admin" cannot list resource "serviceaccounts" in API group "" at the cluster scope
E0306 07:49:49.023505 1 reflector.go:138] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.ServiceAccount: failed to list *v1.ServiceAccount: serviceaccounts is forbidden: User "system:serviceaccount:workload-identity:azure-wi-webhook-admin" cannot list resource "serviceaccounts" in API group "" at the cluster scope
W0306 07:50:29.890010 1 reflector.go:324] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: failed to list *v1.ServiceAccount: serviceaccounts is forbidden: User "system:serviceaccount:workload-identity:azure-wi-webhook-admin" cannot list resource "serviceaccounts" in API group "" at the cluster scope
E0306 07:50:29.890049 1 reflector.go:138] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.ServiceAccount: failed to list *v1.ServiceAccount: serviceaccounts is forbidden: User "system:serviceaccount:workload-identity:azure-wi-webhook-admin" cannot list resource "serviceaccounts" in API group "" at the cluster scope
W0306 07:50:33.080530 1 reflector.go:324] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: failed to list admissionregistration.k8s.io/v1, Kind=MutatingWebhookConfiguration: mutatingwebhookconfigurations.admissionregistration.k8s.io is forbidden: User "system:serviceaccount:workload-identity:azure-wi-webhook-admin" cannot list resource "mutatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope
E0306 07:50:33.080567 1 reflector.go:138] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch admissionregistration.k8s.io/v1, Kind=MutatingWebhookConfiguration: failed to list admissionregistration.k8s.io/v1, Kind=MutatingWebhookConfiguration: mutatingwebhookconfigurations.admissionregistration.k8s.io is forbidden: User "system:serviceaccount:workload-identity:azure-wi-webhook-admin" cannot list resource "mutatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope
Environment
The text was updated successfully, but these errors were encountered: