Add IRSA Support to Self Managed Clusters #4094

luthermonson · 2023-02-22T19:23:03Z

What type of PR is this?
/kind feature

What this PR does / why we need it: Adds IRSA (oidc provider/pod-identity-webhook) functionality to an IAAS cluster. This will bring the IAAS clusters inline in functionality with Managed clusters which provide most of this functionality out of the box.

How this was Accomplished:
In CAPA kubeadm already builds in serving certs and since kube 1.21 there are two endpoints in the apiserver which make it an oidc provider, these are /.well-known/openid-configuration and /openid/v1/jwks. One can configure the openid responses through a couple kube apiserver parameters which is managed and ran as a static pod. To accomplish any modifications the preferred method to add additional extra args is actually through kubeadm patches as per a trail of issues which show an api design flaw in the map[string]string used for ExtraArgs.

Once the apiserver end points are running they need to be served publicaly on port 443, since CAPA already uses an s3 bucket for ignition configs I repurposed it to host the contents of the two files and configured the API server accordingly. Then all that has to be done is an oidc provider is created in IAM which points to the S3 bucket and we install the pod-identity-webhook provided by AWS.

The EKS implementation only creates a configmap called boilerplate-oidc-trust-policy. You are expected to copy the contents of this configmap, edit in your details for the name/namespace of the service account and create a new role with that trust policy while assigning additional roles you with the account to use. There is an example I use to test IRSA which a role will need to be created and given both the trust policy and assigned addtional roles and it will list your s3 buckets confirming the oidc provider and IRSA are functioning.

Extra details to look for to understand how this actually works, the pod-identity-webhook adds a mutating webhook for all pod creation. If the pod has a service account and it contains an annotation like eks.amazonaws.com/role-arn: arn:aws:iam::<accountID>:role/test it will know to generate an STS token and inject it directly into the pod via Env variables.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #3560

Special notes for your reviewer:
Adding a new ReconcileOIDCProvider to the AWSCluster reconciliation loop.

Created a new IAM service, in the future the same logic for EKS could be combined as previously the oidc code was buried in the EKS service. Details on the reconciler can be found in comments.
Updated the S3 service to take keys instead of Machines and added a CreatePublic for the JWKS/OIDC config
Exposed a ManagementClient and RemoteClient for both cluster types and exported Client.
Moved OIDCProvider status type to v1beta2 and migrated out of the EKS API to make one type both clusters can reference a single type.
This cyclomatic showed up during testing so I refactored reconcileNormal

controllers/awsmachine_controller.go:435:1: cyclomatic complexity 38 of func `(*AWSMachineReconciler).reconcileNormal` is high (> 30) (gocyclo)
func (r *AWSMachineReconciler) reconcileNormal(ctx context.Context, machineScope *scope.MachineScope, clusterScope cloud.ClusterScoper, ec2Scope scope.EC2Scope, elbScope scope.ELBScope, objectStoreScope scope.S3Scope) (ctrl.Result, error) {

Checklist:

squashed commits
includes documentation
adds unit tests
adds or updates e2e tests

Release note:

adding IRSA support for self-hosted clusters

Skarlso · 2023-02-23T06:23:16Z

/assign

k8s-ci-robot · 2023-10-14T18:57:26Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

luthermonson · 2023-10-22T16:34:14Z

@davidspek I might take this up and try finishing it if @luthermonson is out.

i'm not quite dead yet

pavansokkenagaraj · 2023-11-23T16:23:03Z

Hi @luthermonson, Please correct me if I am wrong,

If the pod-identity-webhook is situated on a worker node, and that node experiences downtime or is rescheduled to another node, the subsequent rescheduling of new pods on the affected node results in both the pods and the pod-identity-webhook being scheduled simultaneously.

This simultaneous rescheduling disrupts the injection of values from the pod-identity-webhook into the pods, leading to an absence of the expected injected values.

To address this issue, several potential solutions can be considered:

Schedule pod-identity-webhook in the control plane:
Ensure that the pod-identity-webhook is scheduled within the control plane, preventing disruptions caused by worker node rescheduling.
Implement a PodDistributionBudget with min.available set to 1:
Utilize a PodDistributionBudget to guarantee a minimum number of available pods (set to 1) even during rescheduling scenarios, maintaining continuous operation.
Establish a PriorityClass:
Employ a PriorityClass to assign priority to the pod-identity-webhook, influencing scheduling decisions and reducing the likelihood of simultaneous rescheduling.
Configure a MutatingWebhookConfiguration with failurePolicy set to fail (or make it optional):
Define a MutatingWebhookConfiguration with a failurePolicy set to "fail" or consider making it optional. This helps handle errors more explicitly during webhook execution.

Additionally, it's crucial to address the lack of updates on pod-identity-webhook resources to prevent upgrade issues when introducing new features during capi-aws upgrades. Regularly update the pod-identity-webhook resources to ensure compatibility with the latest capabilities and avoid potential complications during upgrades.

Example: Updating the pod-identity-webhook image will not be reconciled

richardcase · 2023-11-25T08:04:28Z

@luthermonson - feel free to ping me if i can help get this in.

pavansokkenagaraj · 2023-12-04T13:44:05Z

pkg/cloud/services/iam/reconcilers.go

+
+// reconcilePodIdentityWebhook generates certs and starts the webhook in the workload cluster
+// https://github.com/aws/amazon-eks-pod-identity-webhook
+// 1. generate webhook certs via cert-manager in the management cluster


@luthermonson
What is the reasoning behind creating certs resources in the management cluster?

Security. You don't want unencrypted traffic to an HTTP webhook endpoint.

ehn, i think i know what he's asking... @pavansokkenagaraj capa does not automatically install cert-manager into the downstream cluster. for capi projects the management cluster gets cert-manager automatically and is the only place we can generate them which is why it's done there and moved

richardcase · 2023-12-13T18:13:37Z

@luthermonson - sorry i need to get back to this.

k8s-ci-robot · 2024-02-29T12:22:07Z

@luthermonson: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-cluster-api-provider-aws-build-docker	`ea2869e`	link	true	`/test pull-cluster-api-provider-aws-build-docker`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

joshfrench · 2024-04-23T13:47:51Z

Is there anything an interested observer can do to help get this over the finish line?

sl1pm4t · 2024-07-22T10:35:32Z

While testing this today, I found the CAPA controller needed these additional permissions:

- apiGroups:
  - cert-manager.io
  resources:
  - certificates
  - issuers
  verbs:
  - create
  - get
  - list
  - delete
- apiGroups:
    - controlplane.cluster.x-k8s.io
  resources:
    - kubeadmcontrolplanes
  verbs:
    - get
    - list
    - patch
    - update

sl1pm4t · 2024-07-23T01:59:14Z

Also while testing this PR, and upon trying to delete the cluster I found the deletion was getting blocked because CAPA controller was throwing error if the OIDC provider didn't exist / had already been deleted, so I added this to gracefully handle the NoSuchEntity error condition from AWS API:

sl1pm4t@5956c87

(copying the error handling pattern seen in other services)

richardcase · 2024-07-23T06:48:49Z

It will be a push to get this in but lets add this so we are tracking it:

/milestone v2.6.0

sl1pm4t · 2024-07-23T07:09:06Z

pkg/cloud/services/iam/reconcilers.go

+		return err
+	}
+
+	if err := reconcilePodIdentityWebhookComponents(ctx, s.scope.Namespace(), certSecret, remoteClient); err != nil {


The use of s.scope.Namespace() here means that CAPA will attempt to install the pod-identity-webhook into a remote namespace matching whatever namespace the cluster resources were created in on the management cluster

This works for the simplest case of installing everything into default, but if the cluster resources are created in something else like a namespace called capi-<cluster-name> then this will fail, because the namespace does not exist in the remote cluster.

IMO - the pod-identity-webhook should be deployed into the kube-system namespace.

I've pushed a commit to my fork here that:

hard codes the installation namespace of pod-identity-webhook to be kube-system

fixes the certificates to reflect the new namespace

fixes an issue that could prevent a successful reconcile if the webhook certificate secret already existed in the remote cluster (from a previous reconcile loop).

bumps the pod-identity-webhook image version to the latest (v0.5.5)

this was based on the install examples of the podid wh... https://github.com/aws/amazon-eks-pod-identity-webhook/blob/master/deploy/deployment-base.yaml#L5 trying to keep the install as similar to upstream as possible which does install it into default. the rig I used to work on made that namespace for you in the workload cluster (we created and pivoted and put everything there) so I never caught that bug,

I'm ok with hardcoding to kube-system (which is probably the best place for it) but just like to make a case for default being where the AWS/EKS devs intended self-install users to put it

I've always thought it was strange their examples deployed to default. From a permissions point of view, it's easy to protect access to it in kube-system but often users would have permission in default.

I note that kOps also hard codes it to kube-system: https://github.com/kubernetes/kops/blob/41480d95a96edad247ecd777b2790ecda33f6a85/upup/models/cloudup/resources/addons/eks-pod-identity-webhook.addons.k8s.io/k8s-1.16.yaml.template#L69

richardcase · 2024-07-29T08:45:04Z

Pushing back:

/milestone v2.7.0

damdo · 2024-10-15T15:35:09Z

@luthermonson this will need rebasing

sl1pm4t · 2024-10-15T19:26:23Z

@damdo - with agreement from @luthermonson I have continued the work on this PR (including rebase) over in #5109

damdo · 2024-10-17T08:23:52Z

Thanks for the heads up @sl1pm4t
I suppose we can close this PR in favor of #5109 then @luthermonson ?

richardcase · 2024-10-21T19:30:45Z

As #5109 is the new PR:

/close

k8s-ci-robot · 2024-10-21T19:30:49Z

@richardcase: Closed this PR.

In response to this:

As #5109 is the new PR:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot requested review from Ankitasw and dlipovetsky February 22, 2023 19:25

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 22, 2023

luthermonson force-pushed the awscluster-irsa branch 8 times, most recently from 9e499a4 to 62fe36e Compare February 23, 2023 06:06

k8s-ci-robot assigned Skarlso Feb 23, 2023

luthermonson force-pushed the awscluster-irsa branch 12 times, most recently from 4ff662e to 44f63a6 Compare February 23, 2023 20:43

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 14, 2023

Skarlso force-pushed the awscluster-irsa branch 2 times, most recently from eafcd38 to d9ac6f0 Compare October 16, 2023 13:02

luthermonson and others added 2 commits October 24, 2023 08:19

adding IRSA functionality through an OIDC provider using an s3 bucket

29896f2

applied a small number of refactorings as I read through the code

ea2869e

Skarlso force-pushed the awscluster-irsa branch from d9ac6f0 to ea2869e Compare October 24, 2023 06:19

pavansokkenagaraj reviewed Dec 4, 2023

View reviewed changes

k8s-ci-robot added this to the v2.6.0 milestone Jul 23, 2024

sl1pm4t reviewed Jul 23, 2024

View reviewed changes

k8s-ci-robot modified the milestones: v2.6.0, v2.7.0 Jul 29, 2024

sl1pm4t mentioned this pull request Aug 21, 2024

✨ feat: Add IRSA Support for Self Managed Clusters (2024 edition) #5101

Closed

5 tasks

sl1pm4t mentioned this pull request Aug 28, 2024

✨ feat: Add IRSA support for self-managed clusters (rebase) #5109

Open

5 tasks

k8s-ci-robot closed this Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add IRSA Support to Self Managed Clusters #4094

Add IRSA Support to Self Managed Clusters #4094

luthermonson commented Feb 22, 2023 •

edited

Loading

Skarlso commented Feb 23, 2023

k8s-ci-robot commented Oct 14, 2023

luthermonson commented Oct 22, 2023

pavansokkenagaraj commented Nov 23, 2023 •

edited

Loading

richardcase commented Nov 25, 2023

pavansokkenagaraj Dec 4, 2023

Skarlso Dec 4, 2023

luthermonson Dec 4, 2023

richardcase commented Dec 13, 2023

k8s-ci-robot commented Feb 29, 2024

joshfrench commented Apr 23, 2024

sl1pm4t commented Jul 22, 2024

sl1pm4t commented Jul 23, 2024 •

edited

Loading

richardcase commented Jul 23, 2024

sl1pm4t Jul 23, 2024 •

edited

Loading

sl1pm4t Jul 23, 2024 •

edited

Loading

luthermonson Jul 23, 2024

sl1pm4t Jul 23, 2024 •

edited

Loading

richardcase commented Jul 29, 2024

damdo commented Oct 15, 2024

sl1pm4t commented Oct 15, 2024

damdo commented Oct 17, 2024

richardcase commented Oct 21, 2024

k8s-ci-robot commented Oct 21, 2024

Add IRSA Support to Self Managed Clusters #4094

Add IRSA Support to Self Managed Clusters #4094

Conversation

luthermonson commented Feb 22, 2023 • edited Loading

Skarlso commented Feb 23, 2023

k8s-ci-robot commented Oct 14, 2023

luthermonson commented Oct 22, 2023

pavansokkenagaraj commented Nov 23, 2023 • edited Loading

richardcase commented Nov 25, 2023

pavansokkenagaraj Dec 4, 2023

Choose a reason for hiding this comment

Skarlso Dec 4, 2023

Choose a reason for hiding this comment

luthermonson Dec 4, 2023

Choose a reason for hiding this comment

richardcase commented Dec 13, 2023

k8s-ci-robot commented Feb 29, 2024

joshfrench commented Apr 23, 2024

sl1pm4t commented Jul 22, 2024

sl1pm4t commented Jul 23, 2024 • edited Loading

richardcase commented Jul 23, 2024

sl1pm4t Jul 23, 2024 • edited Loading

Choose a reason for hiding this comment

sl1pm4t Jul 23, 2024 • edited Loading

Choose a reason for hiding this comment

luthermonson Jul 23, 2024

Choose a reason for hiding this comment

sl1pm4t Jul 23, 2024 • edited Loading

Choose a reason for hiding this comment

richardcase commented Jul 29, 2024

damdo commented Oct 15, 2024

sl1pm4t commented Oct 15, 2024

damdo commented Oct 17, 2024

richardcase commented Oct 21, 2024

k8s-ci-robot commented Oct 21, 2024

luthermonson commented Feb 22, 2023 •

edited

Loading

pavansokkenagaraj commented Nov 23, 2023 •

edited

Loading

sl1pm4t commented Jul 23, 2024 •

edited

Loading

sl1pm4t Jul 23, 2024 •

edited

Loading

sl1pm4t Jul 23, 2024 •

edited

Loading

sl1pm4t Jul 23, 2024 •

edited

Loading