Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] IRSA example login errors #1083

Closed
Justin-DynamicD opened this issue Nov 3, 2020 · 20 comments · Fixed by #1090
Closed

[bug] IRSA example login errors #1083

Justin-DynamicD opened this issue Nov 3, 2020 · 20 comments · Fixed by #1090

Comments

@Justin-DynamicD
Copy link

the example deployment outlined here:
https://github.com/terraform-aws-modules/terraform-aws-eks/tree/master/examples/irsa

I could not get this example to function in my testing, running into IAM permissions issues.

While the role and permissions appear to get created properly, after modifying and apply the helm chart, the resulting auto-scaler service gets stuck in a crash loop stating access denied and declaring it did not have the nessisary permissions. to perform the action, failing imediately citing lack of sts:AssumeRoleWithWebIdentity permissions (verified this is set in the trust relationship of the default "cluster-autoscaler" role"

The only variations made was I updated the cluster to 1.18, and set the image version accordingly in the helm chart.

After struggling with this, I fell back on AWS documentation and applied permissions directly to the EC2 roles generated by the module and the service started perfectly.

I feel like there is some step missing here Im unaware of, in order to use the endpoint?

My actual module call is below, tags remvoed:

module "eks" {
  source           = "terraform-aws-modules/eks/aws"
  cluster_name     = local.cluster_name
  cluster_version  = "1.18"
  subnets          = data.terraform_remote_state.network.outputs.private_subnets
  write_kubeconfig = false
  vpc_id           = data.terraform_remote_state.network.outputs.vpc_id
  enable_irsa      = true
  worker_groups = [
    {
      name                 = "default_workers"
      instance_type        = "t2.medium"
      asg_max_size         = 1
      platform             = "linux"
      kubelet_extra_args   = "--node-labels=spot=false"
      suspended_processes  = ["AZRebalance"]
      tags = [
        {
          "key"                 = "k8s.io/cluster-autoscaler/enabled"
          "propagate_at_launch" = "false"
          "value"               = "true"
        },
        {
          "key"                 = "k8s.io/cluster-autoscaler/${local.cluster_name}"
          "propagate_at_launch" = "false"
          "value"               = "true"
        }
      ]
    },
    {
      name                 = "win_workers"
      instance_type        = "t2.small"
      asg_desired_capacity = 1
      asg_max_size         = 2
      platform             = "windows"
    }
  ]

  worker_groups_launch_template = [
    {
      name                    = "nix_spot"
      override_instance_types = ["t3.medium", "t3.large"]
      spot_instance_pools     = 2
      asg_max_size            = 5
      platform                = "linux"
      kubelet_extra_args      = "--node-labels=kubernetes.io/lifecycle=spot"
      public_ip               = false
      autoscaling_enabled     = true
      protect_from_scale_in   = true
  ]
  map_accounts = []
  map_roles    = []
  map_users    = var.eks_users
}
@barryib
Copy link
Member

barryib commented Nov 3, 2020

@Justin-DynamicD which version of this module are you using ?

Are you using the terraform-aws-modules/iam/aws//modules/iam-assumable-role-with-oidc to define the IAM role with OIDC like in the example ? If yes, can you please test it with the version 3.x of that module please.

module "iam_assumable_role_admin" {
  source                        = "terraform-aws-modules/iam/aws//modules/iam-assumable-role-with-oidc"
  version                       = "~> v3.0"
  create_role                   = true
  role_name                     = "cluster-autoscaler"
  provider_url                  = replace(module.eks.cluster_oidc_issuer_url, "https://", "")
  role_policy_arns              = [aws_iam_policy.cluster_autoscaler.arn]
  oidc_fully_qualified_subjects = ["system:serviceaccount:${local.k8s_service_account_namespace}:${local.k8s_service_account_name}"]
}

@Justin-DynamicD
Copy link
Author

Justin-DynamicD commented Nov 3, 2020

I am, but I'm using it -as-is from the example page linked, meaning its at 2.14.0. Ill retry off hours and let you know how it behaves, thanks for the tip!

Edit: I'm re-inting with the following module versions:

  • Using previously-installed hashicorp/local v1.4.0
  • Using previously-installed hashicorp/null v2.1.2
  • Using previously-installed hashicorp/template v2.2.0
  • Using previously-installed hashicorp/kubernetes v1.13.2
  • Using previously-installed hashicorp/random v2.3.0
  • Using previously-installed hashicorp/aws v3.11.0

TF version is: 0.13.4

@Justin-DynamicD
Copy link
Author

no dice, sadly. I did not completely destroy, just deleted the old deployment then tried hte helm chart again:

E1104 02:50:42.813416       1 aws_manager.go:265] Failed to regenerate ASG cache: cannot autodiscover ASGs: WebIdentityErr: failed to retrieve credentials
caused by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity
        status code: 403, request id: 919c3b67-0021-4543-8545-020c36371246
F1104 02:50:42.813448       1 aws_cloud_provider.go:382] Failed to create AWS Manager: cannot autodiscover ASGs: WebIdentityErr: failed to retrieve credentials
caused by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity
        status code: 403, request id: 919c3b67-0021-4543-8545-020c36371246

Im at a loss. I guess Ill just fall back on applying the permissions to the worker role for now as I honeslty have no idea where to begin to troubleshoot this feature.

@barryib
Copy link
Member

barryib commented Nov 4, 2020

My point here #1083 (comment) is to update the IAM role for oidc version constraint and terraform apply again to see if it create your assumable role correctly.

@Justin-DynamicD
Copy link
Author

Sorry if I was not clear. Here's what I did to the environment:

  1. updated the version constraint like you asked and ran a new terraform apply.
  2. removed the old auto-scaler deployment (kubectl delete [...]).
  3. updated settings.yaml and verified the arn matched what was displayed in IAM for said role.
  4. deployed the autoscaler with helm as per the README.
  5. pulled the resulting logs, noticed I was still getting the 403, and reverted back to using the "old" method outlined here which doesn't use helm to deploy but also relies on the local EC2 role for it's permissions, which works perfectly.

My problem is I I honestly dont understand how IRSA works at all, so I dont even know where to look to verify the deployment is trying to hit the right endpoint, using right creds, etc. I was hoping this template would get me a fast running example so I could see a "functioning" cluster, but I jsut seem to have hit a brick wall.

@Justin-DynamicD
Copy link
Author

Justin-DynamicD commented Nov 4, 2020

Im wondering if in the helm plan I need to make sure the k8s_service_account_name matches what's expected? finally reading up on how this IRSA is supposed ot work and I think maybe the helm chart and the terraform plan are not making the same service account names.

EDIT: Update

OK if I run kubectl describe serviceaccount -n=kube-system cluster-autoscaler-aws-cluster-autoscaler I come up empty ... so it DOES seem the iam_assumable_role_admin module is not working for me (im asuming I should find that role name)

@barryib
Copy link
Member

barryib commented Nov 4, 2020

It sounds like your helm chart doesn't create correct annotation for your service account. Are you using defining helm values like this https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/examples/irsa/cluster-autoscaler-chart-values.yaml#L5-L7 ?

rbac:
  create: true
  serviceAccount:
    annotations:
      eks.amazonaws.com/role-arn: "arn:aws:iam::<ACCOUNT ID>:role/cluster-autoscaler"

We updated the doc yesterday #1063

@brannondorsey
Copy link
Contributor

I'm experiencing this exact same issue. I've confirmed the annotation is applied to the service account as mentioned above, and that it points to the ARN identified in this role.

@Justin-DynamicD
Copy link
Author

well I'm glad im not crazy/the only one not seeing this work?

I did modify the heml chart yaml. As I am running eks 1.18, I updated the file to the following:

awsRegion: us-west-2

rbac:
  create: true
  serviceAccount:
    annotations:
      eks.amazonaws.com/role-arn: "arn:aws:iam::<redacted>:role/cluster-autoscaler"

autoDiscovery:
  clusterName: <my cluster name>
  enabled: true

image:
  repository: us.gcr.io/k8s-artifacts-prod/autoscaling/cluster-autoscaler
  tag: v1.18.3

@Justin-DynamicD
Copy link
Author

should this line be updated?

servicename

Not certain why it's set as "cluster-autoscaler-aws-cluster-autoscaler" whereas the aws role is simply "cluster-autoscaler". Again, I appologize if im blindly poking at the wrong things, I'm really just trying to look for gaps.

@brannondorsey
Copy link
Contributor

brannondorsey commented Nov 5, 2020

This comment tipped me off to the solution. It appears that the latest cluster-autoscaler helm chart creates a service account named cluster-autoscaler-aws-cluster-autoscaler-chart instead of cluster-autoscaler-aws-cluster-autoscaler which the irsa example expects.

@Justin-DynamicD are you using the cluster-autoscaler chart from the new helm repo or the deprecated stable/charts repo one linked from the irsa example README? If it's the former (like me), then updating the locals.k8s_service_account_name to cluster-autoscaler-aws-cluster-autoscaler-chart should solve the issue.

Perhaps we should update the irsa example README to link to the new helm chart and update locals.k8s_service_account_name to match.

@brannondorsey
Copy link
Contributor

Somewhat ironically, updating the iam_assumable_role_admin to version ~> v3.0 instead of version 2.14.0 as @barryib suggested in #1083 (comment) actually introduced a problem, as terraform removes the required role <> policy attachment for some reason.

Screen Shot 2020-11-05 at 11 04 00 AM

By using 2.14.0 instead I got this working with the solution in my previous comment.

@Justin-DynamicD
Copy link
Author

can confirm: v3 does remove a role as well:

An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # module.iam_assumable_role_admin.aws_iam_role_policy_attachment.custom[0] will be created
  + resource "aws_iam_role_policy_attachment" "custom" {
      + id         = (known after apply)
      + policy_arn = "arn:aws:iam::<redacted>:policy/cluster-autoscaler20201103041508874900000018"
      + role       = "cluster-autoscaler"
    }

Plan: 1 to add, 0 to change, 0 to destroy.

I think you unlocked pandoras box for me. Seems like i should have either used the deprecated autoscaler or dug into helm deeper.

@Justin-DynamicD
Copy link
Author

Justin-DynamicD commented Nov 5, 2020

Confirming the issue was the newer auto-scaler excpects a different name than configured.

In the end, I had to update the service account name to match the new name, and now things seem to be running.

Thank you all for your help in this.

(EDIT: also note the new auto-scaler does not have any OS filters on the node selector, so if you are running a mixed-worker environment, you need to add a nodeselector or similar).

@barryib
Copy link
Member

barryib commented Nov 5, 2020

To avoid this, shouldn't we set a service account name for the cluster autoscaler https://github.com/helm/charts/blob/master/stable/cluster-autoscaler/values.yaml#L121 ?

@barryib
Copy link
Member

barryib commented Nov 5, 2020

Somewhat ironically, updating the iam_assumable_role_admin to version ~> v3.0 instead of version 2.14.0 as @barryib suggested in #1083 (comment) actually introduced a problem, as terraform removes the required role <> policy attachment for some reason.

Screen Shot 2020-11-05 at 11 04 00 AM

By using 2.14.0 instead I got this working with the solution in my previous comment.

Thanks for sharing this. I'll take a look as soon as I have a moment.

@brannondorsey
Copy link
Contributor

To avoid this, shouldn't we set a service account name for the cluster autoscaler https://github.com/helm/charts/blob/master/stable/cluster-autoscaler/values.yaml#L121 ?

Yep, that works too. Although I think it actually makes more sense to update local.k8s_service_account_name in locals.tf instead given that the README now recommends using the new helm chart (hosted at https://kubernetes.github.io/autoscaler) instead of the deprecated one.

I've opened a PR #1090 with these changes as well as a few other fixes in the IRSA README.

@barryib
Copy link
Member

barryib commented Nov 6, 2020

Great. Thanks for opening a PR for this. And yes, we should update the example to use the recommended helm chart.

With that said, in production environment, I think it better to be explicit for these kind of change instead of trying to always create something that should match the chart default. Defaults will change during upgrades and you'll noticed most of the time when it's too late.

This only engages me, but to me we should also set explicitly the autoscaler service account name to always match with local.k8s_service_account_name.

What do you think ?

@brannondorsey
Copy link
Contributor

brannondorsey commented Nov 6, 2020

With that said, in production environment, I think it better to be explicit for these kind of change instead of trying to always create something that should match the chart default. Defaults will change during upgrades and you'll noticed most of the time when it's too late.

Yeah, that's a good point. If this relationship had been more specific to begin with, it might not have been such a mystery as to what was going wrong here. I've updated the PR to explicitly set the rbac.serviceAccount.name in cluster-autoscaler-chart-values.yaml and added a code comment about this relationship in this commit
1e04a4a.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
3 participants