Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Azure AD Workload Identity #54

Open
stevehipwell opened this issue Nov 11, 2022 · 26 comments
Open

Support Azure AD Workload Identity #54

stevehipwell opened this issue Nov 11, 2022 · 26 comments
Labels
help wanted Extra attention is needed

Comments

@stevehipwell
Copy link

Is your proposal related to a problem?

The currently supported Azure AD Pod Identity is deprecated in favour of the new Azure AD Workload identity.

Describe the solution you'd like

I'd like Thanos to support Azure AD Workload Identity.

Describe alternatives you've considered

n/a

Additional context

The following 2 PRs are for adding this support to other projects and might help with the required changes.

@fpetkovski fpetkovski added the help wanted Extra attention is needed label Nov 14, 2022
@phillebaba
Copy link
Contributor

phillebaba commented Nov 14, 2022

Lets move this to objstore. I can get this done when I have time. Right now AKS workload identity is still in public preview, so it has not been high enough prio for me. This will most likely only require some docs changes and maybe some configuration changes, as we have already made the migration to the new SDK.

@stevehipwell
Copy link
Author

@phillebaba the public preview part is the direct AKS integration via a cluster option, the Azure AD Workload Identity technology looks to be GA and has been recommended over Azure AD Pod Identity (which is now deprecated) for a relatively long time.

Based on the implementations I've seen there is a code change required but it doesn't look to be too significant.

@phillebaba
Copy link
Contributor

phillebaba commented Nov 14, 2022

Yes sorry what I mean was the AKS integration.
https://learn.microsoft.com/en-us/azure/aks/workload-identity-deploy-cluster

So the deprecation of AAD Pod Identity has more been used a signal to the community that no new features will be accepted into the project. Especially because the same people working on that project are also working on the new solution. However security patches will still be provided for a longer time.

The major reason we have not seen greater adoption of workload identities has been for the fact that it has not supported managed identities until recently. The reasons for this seemed to be due to technical limitations at the time in Azure. From my perspective there has been no point to switch until manage identity support was resolved which has just happened.
Azure/azure-workload-identity#325

I will try to get to this before the next release of Thanos.

@stevehipwell
Copy link
Author

Thanks @phillebaba, that would be great. You're absolutely right that this hasn't worked until very recently (even more recently if you wanted to use Terraform). The reason I'm so keen to get it working now is that the current Azure AD Pod Identity solution causes significant friction due to various defects and this is the path for us to remove that whole component from our AKS clusters.

@RagibAjmal
Copy link

Hi @phillebaba

I am new to open source contribution and I would like to work on this issue. Can I go ahead and work on this ?

@phillebaba
Copy link
Contributor

I had a look at this just now, and honestly I am wondering if this is already working and all we need are some docs. I did the whole SDK upgrade during the summer for this reason. Has anyone actually tried to run Thanos with Azure Workload Identity in AKS? All required configuration is injected as environment variables to the Pod so the MSI configuration should just pick it up. I don't have time right now so if you want to you could maybe verify if it works @RagibAjmal and get back with the results.

@RagibAjmal
Copy link

Sure I will work on it and get back with the results

@RagibAjmal
Copy link

RagibAjmal commented Nov 20, 2022

Hi @phillebaba ,

Please correct/guide me if I am wrong

I tried to access my azure storage container from thanos store using storage_account_key and the pod works fine, but when I try to use user_assigned_id(clientID of Managed Identity) instead of storage_account_key, the pod goes to CrashLoopBackOff and I get the below log from the pod.

level=debug ts=2022-11-20T20:09:49.183966621Z caller=main.go:67 msg="maxprocs: Leaving GOMAXPROCS=[2]: CPU quota undefined"
level=info ts=2022-11-20T20:09:49.18435653Z caller=factory.go:52 msg="loading bucket configuration"
level=debug ts=2022-11-20T20:09:49.184471532Z caller=azure.go:139 msg="creating new Azure bucket connection" component=store
level=error ts=2022-11-20T20:09:49.195210169Z caller=main.go:135 err="===== INTERNAL ERROR =====\nManagedIdentityCredential authentication failed\nGET http://169.254.169.254/metadata/identity/oauth2/token\n--------------------------------------------------------------------------------\nRESPONSE 400 Bad Request\n--------------------------------------------------------------------------------\n{\n "error": "invalid_request",\n "error_description": "Identity not found"\n}\n--------------------------------------------------------------------------------\n\nAzure API return unexpected error: *azblob.InternalError\n\ngithub.com/thanos-io/objstore/providers/azure.NewBucketWithConfig\n\t/go/pkg/mod/github.com/thanos-io/[email protected]/providers/azure/azure.go:170\ngithub.com/thanos-io/objstore/providers/azure.NewBucket\n\t/go/pkg/mod/github.com/thanos-io/[email protected]/providers/azure/azure.go:150\ngithub.com/thanos-io/objstore/client.NewBucket\n\t/go/pkg/mod/github.com/thanos-io/[email protected]/client/factory.go:70\nmain.runStore\n\t/app/cmd/thanos/store.go:254\nmain.registerStore.func1\n\t/app/cmd/thanos/store.go:198\nmain.main\n\t/app/cmd/thanos/main.go:133\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594\ncreate AZURE client\ngithub.com/thanos-io/objstore/client.NewBucket\n\t/go/pkg/mod/github.com/thanos-io/[email protected]/client/factory.go:87\nmain.runStore\n\t/app/cmd/thanos/store.go:254\nmain.registerStore.func1\n\t/app/cmd/thanos/store.go:198\nmain.main\n\t/app/cmd/thanos/main.go:133\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594\ncreate bucket client\nmain.runStore\n\t/app/cmd/thanos/store.go:256\nmain.registerStore.func1\n\t/app/cmd/thanos/store.go:198\nmain.main\n\t/app/cmd/thanos/main.go:133\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594\npreparing store command failed\nmain.main\n\t/app/cmd/thanos/main.go:135\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594"

I can find the error is from objstore/providers/azure/azure.go so I tried to findout what happend by running in local

the function getContainerClient in helpers.go returns value when I provide user_assigned_id but the function containerClient.GetProperties prints the below value in my local

{{{ } }} ===== INTERNAL ERROR =====
ManagedIdentityCredential: Get "http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&client_id=**I_am_hiding_this_value**&resource=https%3A%2F%2Fstorage.azure.com": dial tcp 169.254.169.254:80: i/o timeout

Here is what i get when i use in storage_account_key my local

{{{ 2022-11-20 20:14:40 +0000 GMT 0x140000120d0 0x14000018237 0x14000012050 0x14000018235 0x14000018236 0x14000018238 2022-11-19 10:27:50 +0000 GMT 0x14000012070 0x14000012080 map[] 0x140000120a0 0x140000120b0} 0x14000078000}}

Also, I am just going through the tutorials of workload Identity and I find that a webhook is inserting the clientID into the pod as a env var named AZURE_CLIENT_ID using ServiceAccount. I tried using the AZURE_CLIENT_ID using sa and webhooks but still I am getting the CrashLoopBackOff. I tried finding if we are getting the value AZURE_CLIENT_ID in objstore but I can only find we are getting the value user_assigned_id through secrets.

here is the configuration using sa and webhooks in the pod

Environment:
AZURE_CLIENT_ID: hidden by myself
AZURE_TENANT_ID: hidden by myself
AZURE_FEDERATED_TOKEN_FILE: /var/run/secrets/azure/tokens/azure-identity-token
AZURE_AUTHORITY_HOST: https://login.microsoftonline.com/

@phillebaba
Copy link
Contributor

Sounds like the SDK is not picking up the AZURE_CLIENT_ID parameter by itself which is a shame. I need to look further into the SDK to figure out how it is expected to be configured.

@phillebaba
Copy link
Contributor

Ok so I have totally missed a detail in the SDK. Federated tokens are not supported out of the box in the latest Go SDK. It will be available in January hopefully. In the meantime I will implement the code snippet example to get support working.
Azure/azure-sdk-for-go#15615 (comment)

@RagibAjmal
Copy link

Hi @phillebaba

Can I work on this

@phillebaba
Copy link
Contributor

Yeah sure, I would suggest you branch from #35 as it contains a couple of SDK changes caused by the upgrade. It would be great if you could verify that token renewal actually works, as it is not stated. If I remember correctly the token renews every hour.

@weisdd
Copy link

weisdd commented Dec 5, 2022

@phillebaba federated token gets updated once an hour, an access token - every 24h, so the end solution requires at least 25h of testing.

@RagibAjmal
Copy link

@phillebaba @weisdd. Could you please help me with how can we test the application or how/where do you test the application.
My free subscription ended few days before , so I need to pay for kubernetes cluster. I cant find any playground for 25+hours. I am not able to test the application in minikube too as I am unable to configure workload identity on it. I dont know any other way. Please help me with if there are any other way

@weisdd
Copy link

weisdd commented Dec 9, 2022

@RagibAjmal My approach is simple:

  • When I work on something that's needed for my employer, I use company's subscription;
  • When I work on something that concerns only me, then I pay the cloud bills myself.

Even if you cannot afford to pay the cloud bills, there's a way for you to go forward. - While testing the token renewal process, you don't actually need to have a kubernetes cluster. You can write a comprehensive test suite with a test web server, whose responses would emulate token expiration, and you just have to make sure an updated federated token is sent over to your web server. In case of adal, it was enough to just make an explicit call to .Refresh method (you can see it in my PR for cert-manager), the new library is likely to have something similar, it must be fun to research.

@rouke-broersma
Copy link

@phillebaba I've setup WI on my cluster with thanos now that it's GA. It doesn't seem to work:

level=error ts=2023-05-10T07:28:01.723782864Z caller=main.go:135 err="ManagedIdentityCredential authentication failed\nGET http://169.254.169.254/metadata/identity/oauth2/token\n--------------------------------------------------------------------------------\nRESPONSE 400 Bad Request\n--------------------------------------------------------------------------------\n{\n  \"error\": \"invalid_request\",\n  \"error_description\": \"Identity not found\"\n}\n--------------------------------------------------------------------------------\nTo troubleshoot, visit https://aka.ms/azsdk/go/identity/troubleshoot#managed-id\ncreate AZURE client\ngithub.com/thanos-io/objstore/client.NewBucket\n\t/go/pkg/mod/github.com/thanos-io/[email protected]/client/factory.go:87\nmain.runCompact\n\t/app/cmd/thanos/compact.go:205\nmain.registerCompact.func1\n\t/app/cmd/thanos/compact.go:94\nmain.main\n\t/app/cmd/thanos/main.go:133\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594\npreparing compact command failed\nmain.main\n\t/app/cmd/thanos/main.go:135\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594"

I think the issue might be that neither DefaultAzureCredential nor FederatedCredential is created but rather the ManagedIdentityCredential is created directly, which is not a valid credential for workload identity. That credential signals the use of managed identity using the IMDS, instead of the federated token exchange endpoint used by workload identity.

@phillebaba
Copy link
Contributor

@rouke-broersma azidentity v1.3.0 was released two days ago, with official support for workload identity. As soon as I get some time over I will upgrade the package and test it. Unless someone else has time right now to do it.

@stevehipwell
Copy link
Author

@phillebaba should I also open a new issue in thanos-io/objstore/ to track the work?

@phillebaba
Copy link
Contributor

@fpetkovski @GiedriusS could you move this issue to objstore?

If that is not possible it would be better to create a new issue.

@fpetkovski fpetkovski transferred this issue from thanos-io/thanos May 19, 2023
@fpetkovski
Copy link
Contributor

Good idea, moved.

@stevehipwell
Copy link
Author

@phillebaba @fpetkovski don't we need an issue in both repos? This repo needs an issue to track the actual work while the main Thanos repo needs an issue to track when Azure AD Workload Identity support has been added into Thanos.

@cmergenthaler
Copy link

Any update on this?

@rikhil-s
Copy link
Contributor

rikhil-s commented Oct 2, 2023

@phillebaba Any further updates on this? AAD Pod Identity is now end of life so we now have to use Thanos with the Azure Workload Identity Sidecar (which also won't be supported long-term).

rikhil-s added a commit to rikhil-s/objstore that referenced this issue Oct 19, 2023
rikhil-s added a commit to rikhil-s/objstore that referenced this issue Oct 19, 2023
rikhil-s added a commit to rikhil-s/objstore that referenced this issue Oct 19, 2023
rikhil-s added a commit to rikhil-s/objstore that referenced this issue Oct 19, 2023
rikhil-s added a commit to rikhil-s/objstore that referenced this issue Oct 19, 2023
rikhil-s added a commit to rikhil-s/objstore that referenced this issue Oct 19, 2023
rikhil-s added a commit to rikhil-s/objstore that referenced this issue Oct 19, 2023
rikhil-s added a commit to rikhil-s/objstore that referenced this issue Oct 19, 2023
@rikhil-s
Copy link
Contributor

PR has been created here #82 to resolve this issue. Just need someone to review!

rikhil-s added a commit to rikhil-s/objstore that referenced this issue Oct 23, 2023
rikhil-s added a commit to rikhil-s/objstore that referenced this issue Oct 27, 2023
rikhil-s added a commit to rikhil-s/objstore that referenced this issue Nov 10, 2023
@cmergenthaler
Copy link

Isn't this implemented with #82 now?

@MichaHoffmann
Copy link
Contributor

Yeah i think this issue can be closed! It can be tested with 0.33.0-rc.0 already

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

9 participants