Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tenant and client ID are not used from AzureClusterIdentity object for virtualmachineimages client #3569

Closed
Tracked by #3409
sonasingh46 opened this issue May 16, 2023 · 7 comments · Fixed by #3606 or #3843
Closed
Tracked by #3409
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.
Milestone

Comments

@sonasingh46
Copy link
Contributor

/kind bug

[Before submitting an issue, have you checked the Troubleshooting Guide?]

What steps did you take and what happened:
[A clear and concise description of what the bug is.]

The client at this code
https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/main/azure/services/virtualmachineimages/client.go#L52
does not use the tenant ID and client ID that is passed in AzureClusterIdentity object.

What did you expect to happen:

When tenant ID and cluster ID is passed via AzureClusterIdentity, then the values present on AzureClusterIdentity should be used as a precedence over tenant ID and client ID env variables. Or
There should be a way to create client where we can explicitly pass tenant and client ID.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

  • cluster-api-provider-azure version:
  • Kubernetes version: (use kubectl version):
  • OS (e.g. from /etc/os-release):
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label May 16, 2023
@mboersma
Copy link
Contributor

/milestone v1.10

@k8s-ci-robot k8s-ci-robot added this to the v1.10 milestone May 16, 2023
@mboersma
Copy link
Contributor

This is the behavior of DefaultAzureCredential that is probably not working here: https://github.com/Azure/azure-sdk-for-go/blob/sdk/azidentity/v1.3.0/sdk/azidentity/default_azure_credential.go#L40

I'll set up an AzureClusterIdentity test locally and see what I can figure out.

@likamrat
Copy link

I am wondering if i am hitting this issue. tried multiple service principals and subscriptions.

Environment

  • k8s - 1.26.x and 1.27.0
  • CAPZ - 1.7.6, 1.8.4, 1.9.2
  • clusterctl - 1.4.2
    Message:               failed to reconcile cluster services: failed to get availability zones: failed to get zones for location westus2: failed to refresh resource sku cache: could not list resource skus: autorest/Client#Do: Preparing request failed: StatusCode=0 -- Original Error: ManagedIdentityCredential authentication failed
GET http://169.254.169.254/metadata/identity/oauth2/token
--------------------------------------------------------------------------------
RESPONSE 503 Service Unavailable
--------------------------------------------------------------------------------
adal: Failed to execute the refresh request. Error = 'Post "https://login.microsoftonline.com/72f988bf-86f1-41af-91ab-2d7cd011db47/oauth2/token?api-version=1.0": dial tcp: lookup login.microsoftonline.com on 10.43.0.10:53: server misbehaving'

--------------------------------------------------------------------------------
To troubleshoot, visit https://aka.ms/azsdk/go/identity/troubleshoot#managed-id
    Reason:                Failed
    Severity:              Error
    Status:                False
    Type:                  Ready
    Last Transition Time:  2023-05-18T20:16:47Z
    Message:               Waiting for control plane provider to indicate the control plane has been initialized
    Reason:                WaitingForControlPlaneProviderInitialized
    Severity:              Info
    Status:                False
    Type:                  ControlPlaneInitialized
    Last Transition Time:  2023-05-18T20:16:47Z
    Message:               Scaling up control plane to 3 replicas (actual 0)
    Reason:                ScalingUp
    Severity:              Warning
    Status:                False
    Type:                  ControlPlaneReady
    Last Transition Time:  2023-05-18T20:17:05Z
    Message:               failed to reconcile cluster services: failed to get availability zones: failed to get zones for location westus2: failed to refresh resource sku cache: could not list resource skus: autorest/Client#Do: Preparing request failed: StatusCode=0 -- Original Error: ManagedIdentityCredential authentication failed
GET http://169.254.169.254/metadata/identity/oauth2/token
--------------------------------------------------------------------------------
RESPONSE 503 Service Unavailable
--------------------------------------------------------------------------------
adal: Failed to execute the refresh request. Error = 'Post "https://login.microsoftonline.com/72f988bf-86f1-41af-91ab-2d7cd011db47/oauth2/token?api-version=1.0": dial tcp: lookup login.microsoftonline.com on 10.43.0.10:53: server misbehaving'

--------------------------------------------------------------------------------
To troubleshoot, visit https://aka.ms/azsdk/go/identity/troubleshoot#managed-id
    Reason:             Failed
    Severity:           Error
    Status:             False
    Type:               InfrastructureReady
  Observed Generation:  1
  Phase:                Provisioning
Events:
  Type    Reason        Age    From                Message
  ----    ------        ----   ----                -------
  Normal  Provisioning  3m53s  cluster-controller  Cluster arcbox-capi-data-288d is Provisioning

@CecileRobertMichon
Copy link
Contributor

Circling back on above, issue from @likamrat is unrelated (more on root cause in https://kubernetes.slack.com/archives/CEX9HENG7/p1684340022925319)

@sonasingh46, @mboersma and I discussed offline, plan is to revert #3474 if @mboersma can't get to a quick fix by end of week

@sonasingh46
Copy link
Contributor Author

[1] So spent some time on this to figure out what is going on here.
The SDK ( like Matt pointed out ) tries the following options for authenticating in order:
- Via ENV
- Via WorkloadIdentity
- Via ManagedIdentity
- Via CLI
So whenever tenant ID and client ID is set up via ENV as well as provided via AzureClusterIdentity -- it end up using the one provided via ENV.

[2] Couple of experiments that I did:

  • Provided IDs via AzureClusterIdentity ServicePrincipal type. The env variables for IDs were also set with the same one as that of provided in AzureClusterIdentity ( Successfully created workload cluster )

  • Provided IDs via AzureClusterIdentity ServicePrincipal type, but made a change in the CAPZ code to unset the env variables for the IDs just before the mgr start in main function. ( This still successfully created workload cluster via managed identity because of the AAD pod identity magic )

  • Provided IDs via AzureClusterIdentity ServicePrincipal type, but made a change in the CAPZ code to set the env variables for the IDs with INCORRECT values. ( This failed workload cluster creation )

[3] How I hit this while testing workload identity?

  • I was passing IDs via the AzureClusteridentity.
  • I was also injecting tenant ID env variable to capz ( via a webhook ). This tenant ID was a dummy/incorrect value.
  • Now the code ended by using tenant ID from AzureClusterIdentity for all other services other than virtualimagesclient and successfully created resources on Azure cloud.
  • For the virtualimagesclient client it ended up using wrong tenant ID causing failures only for virtualimagesclient calls.
  • So workload cluster creation failed with couple of resources being created in Azure cloud.

So long story short:

  • If IDs are passed via AzureClusterIdentity as well as env variables and if the IDs of env variables happen to be incorrect, it kind of leaves the workload cluster in a failed stated with couple of resource created on Azure cloud when using workload identity which should not happen. Either it should fail completely of pass completely.
    So for virtualimagesclient it fails using ENV(because ID is read from ENV and is incorrect), WorkloadIdentity(because ID is read from ENV and is incorrect), as well as ManagedIdentity( there is no AzureIdentity object and aad pod identity is not used) .

  • Now this does not happen with ServicePrincipal because -- it fails when tried using ENV but passes using ManagedIdentity via the aad pod identity(AzureIdentity object exists with correct values) magic for virtualimagesclient

@mboersma
Copy link
Contributor

mboersma commented Jun 5, 2023

/reopen

This was inadvertently closed by triggering the Fixes #1234 regex.

@k8s-ci-robot k8s-ci-robot reopened this Jun 5, 2023
@k8s-ci-robot
Copy link
Contributor

@mboersma: Reopened this issue.

In response to this:

/reopen

This was inadvertently closed by triggering the Fixes #1234 regex.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment