Allow garbage collector to delete ec2 instances #4568

vincepri · 2023-10-10T23:25:59Z

What type of PR is this?

What this PR does / why we need it:

/kind feature

Addition to the experimental gc by @richardcase, we should also see if we can scan the CAPA owned tags, to make sure we don't have any leftover. Thoughts?

Release note:

NONE

k8s-ci-robot · 2023-10-10T23:26:07Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from vincepri. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Vince Prignano <[email protected]>

vincepri · 2023-10-11T02:10:00Z

/retest

vincepri · 2023-10-11T03:10:42Z

/test ?

k8s-ci-robot · 2023-10-11T03:10:44Z

@vincepri: The following commands are available to trigger required jobs:

/test pull-cluster-api-provider-aws-build
/test pull-cluster-api-provider-aws-test
/test pull-cluster-api-provider-aws-verify

The following commands are available to trigger optional jobs:

/test pull-cluster-api-provider-aws-apidiff-main
/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e-blocking
/test pull-cluster-api-provider-aws-e2e-clusterclass
/test pull-cluster-api-provider-aws-e2e-conformance
/test pull-cluster-api-provider-aws-e2e-conformance-with-ci-artifacts
/test pull-cluster-api-provider-aws-e2e-eks
/test pull-cluster-api-provider-aws-e2e-eks-gc
/test pull-cluster-api-provider-aws-e2e-eks-testing

Use /test all to run the following jobs that were automatically triggered:

pull-cluster-api-provider-aws-apidiff-main
pull-cluster-api-provider-aws-build
pull-cluster-api-provider-aws-test
pull-cluster-api-provider-aws-verify

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

vincepri · 2023-10-11T03:11:07Z

/test pull-cluster-api-provider-aws-e2e-eks-gc

vincepri · 2023-10-11T13:57:57Z

Actually seems these tests have been failing for quite some time https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-provider-aws#pr-e2e-eks-gc-main

vincepri · 2023-10-11T13:58:27Z

FWIW, I've tested the logic in a custom cluster by creating an ec2 instance outside of CAPA, and it deleted properly

richardcase · 2023-10-11T21:23:10Z

@vincepri - i'm curious in what scenarios there would be an EC2 instance created that isn't managed by CAPA itself (i.e. via Machines/MachinePools)?

The original idea of the GC was for cleaning up resources that where created by the CCM as this can block CAPA deleting a cluster. So, things like an application being deployed that has a service of type load balancer.

vincepri · 2023-10-11T21:36:13Z

@richardcase A few use cases I had in mind:

This would allow to cleanup cluster leftovers; especially if we do expand the filters to include CAPA owned tags.
Allows to better support OpenShift, which creates resources using an old version of Cluster API, that tags resources with the cloud provider.

JoelSpeed · 2023-10-12T16:02:41Z

We've certainly seen users on AWS in the past create EC2 instances by themselves and join them to OpenShift clusters when the tooling within Kube/OpenShift didn't support a feature that they needed. I expect we aren't the only people seeing users do that

Imagine before we supported say EFA networking, if a user wanted to use that, what would stop them building and adding their own EC2 instances to the workload cluster? I don't think anything would stop them, and we should expect that this has happened somewhere before, where we have feature gaps

vincepri · 2023-10-16T16:38:46Z

/milestone v2.3.0

k8s-ci-robot · 2023-10-16T16:38:47Z

@vincepri: You must be a member of the kubernetes-sigs/cluster-api-provider-aws-maintainers GitHub team to set the milestone. If you believe you should be able to issue the /milestone command, please contact your Cluster API Provider AWS Maintainers and have them propose you as an additional delegate for this responsibility.

In response to this:

/milestone v2.3.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dlipovetsky · 2023-10-16T17:15:36Z

Imagine before we supported say EFA networking, if a user wanted to use that, what would stop them building and adding their own EC2 instances to the workload cluster?

If the user created the EC2 instances, I think we can argue that the user is responsible for deleting them.

However, I think we can argue that the purpose of the garbage collector is to unblock cluster deletion. If a user creates an EC2 instance, CAPA does not delete it in its ordinary reconciliation, and the instance blocks deletion of the subnet, VPC, etc, and therefore blocks cluster deletion.

Therefore, we could say that the garbage collector should be extended to clean up all AWS resources that would block cluster deletion. I think the garbage collector must be "best effort," because there are edge cases. For example, some AWS resources may be missing the correct tags, and removing others may require different AWS credentials than the ones the garbage collector has.

vincepri · 2023-10-17T02:04:54Z

/milestone v2.3.0

k8s-ci-robot · 2023-10-17T02:04:55Z

@vincepri: You must be a member of the kubernetes-sigs/cluster-api-provider-aws-maintainers GitHub team to set the milestone. If you believe you should be able to issue the /milestone command, please contact your Cluster API Provider AWS Maintainers and have them propose you as an additional delegate for this responsibility.

In response to this:

/milestone v2.3.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

vincepri · 2023-10-23T17:36:15Z

@richardcase Is this good to go?

vincepri · 2023-10-23T18:35:27Z

/test pull-cluster-api-provider-aws-e2e-eks-gc

richardcase · 2023-10-26T10:45:02Z

Just testing its not a flake:

/test pull-cluster-api-provider-aws-e2e-eks-gc

(could be that #4575 will be needed...i get the failing test fixed on that PR)

vincepri · 2023-11-13T16:43:14Z

/test pull-cluster-api-provider-aws-e2e-eks-gc

AndiDog

Looks fine, but are there tests covering this case already? Else we need to add tests.

richardcase · 2023-11-20T08:48:55Z

Circling back to this now i finally have some time. After doing another review i think we probably need:

Unit tests covering this new deletion (thanks @AndiDog )
An aletrnative collection function that will descibe/get the EC2 instances like we do for the other resource types.. See this. Some partitions do not allow the use of the resource tagging api and so a fallback (i.e. alternative) is required.

AndiDog · 2023-12-21T09:13:28Z

pkg/cloud/services/gc/ec2.go

+
+		instanceID := strings.ReplaceAll(resource.ARN.Resource, "instance/", "")
+		if err := s.deleteEC2Instance(ctx, instanceID); err != nil {
+			return fmt.Errorf("deleting EC2 instance %s: %w", instanceID, err)


Sorry, I have a new request here because I stumbled over non-actionable error messages in existing GC code (they were missing the region and humans shouldn't be required to loop through all accounts and regions to find an object where only the ID is logged). I'm fixing existing code in a new PR.

Suggested change

return fmt.Errorf("deleting EC2 instance %s: %w", instanceID, err)

return fmt.Errorf("deleting EC2 instance %s with ID %s: %w", resource.ARN, instanceID, err)

richardcase · 2024-01-19T08:21:36Z

/milestone v2.4.0

k8s-ci-robot · 2024-02-29T12:21:50Z

@vincepri: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-cluster-api-provider-aws-build-docker	`7c2d0b5`	link	true	`/test pull-cluster-api-provider-aws-build-docker`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 10, 2023

k8s-ci-robot requested review from AverageMarcus and shivi28 October 10, 2023 23:26

k8s-ci-robot added needs-priority size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 10, 2023

Allow garbage collector to delete ec2 instances

7c2d0b5

Signed-off-by: Vince Prignano <[email protected]>

vincepri force-pushed the ec2-instance-gc branch from 2eb5d82 to 7c2d0b5 Compare October 10, 2023 23:30

richardcase added this to the v2.3.0 milestone Oct 16, 2023

AndiDog reviewed Nov 16, 2023

View reviewed changes

AndiDog reviewed Dec 21, 2023

View reviewed changes

k8s-ci-robot modified the milestones: v2.3.0, v2.4.0 Jan 19, 2024

richardcase removed this from the v2.4.0 milestone Jan 23, 2024

vincepri closed this May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow garbage collector to delete ec2 instances #4568

Allow garbage collector to delete ec2 instances #4568

vincepri commented Oct 10, 2023

k8s-ci-robot commented Oct 10, 2023

vincepri commented Oct 11, 2023

vincepri commented Oct 11, 2023

k8s-ci-robot commented Oct 11, 2023

vincepri commented Oct 11, 2023

vincepri commented Oct 11, 2023

vincepri commented Oct 11, 2023

richardcase commented Oct 11, 2023

vincepri commented Oct 11, 2023

JoelSpeed commented Oct 12, 2023 •

edited

Loading

vincepri commented Oct 16, 2023

k8s-ci-robot commented Oct 16, 2023

dlipovetsky commented Oct 16, 2023

vincepri commented Oct 17, 2023

k8s-ci-robot commented Oct 17, 2023

vincepri commented Oct 23, 2023

vincepri commented Oct 23, 2023

richardcase commented Oct 26, 2023

vincepri commented Nov 13, 2023

AndiDog left a comment

richardcase commented Nov 20, 2023

AndiDog Dec 21, 2023

richardcase commented Jan 19, 2024

k8s-ci-robot commented Feb 29, 2024

	return fmt.Errorf("deleting EC2 instance %s: %w", instanceID, err)
	return fmt.Errorf("deleting EC2 instance %s with ID %s: %w", resource.ARN, instanceID, err)

Allow garbage collector to delete ec2 instances #4568

Allow garbage collector to delete ec2 instances #4568

Conversation

vincepri commented Oct 10, 2023

k8s-ci-robot commented Oct 10, 2023

vincepri commented Oct 11, 2023

vincepri commented Oct 11, 2023

k8s-ci-robot commented Oct 11, 2023

vincepri commented Oct 11, 2023

vincepri commented Oct 11, 2023

vincepri commented Oct 11, 2023

richardcase commented Oct 11, 2023

vincepri commented Oct 11, 2023

JoelSpeed commented Oct 12, 2023 • edited Loading

vincepri commented Oct 16, 2023

k8s-ci-robot commented Oct 16, 2023

dlipovetsky commented Oct 16, 2023

vincepri commented Oct 17, 2023

k8s-ci-robot commented Oct 17, 2023

vincepri commented Oct 23, 2023

vincepri commented Oct 23, 2023

richardcase commented Oct 26, 2023

vincepri commented Nov 13, 2023

AndiDog left a comment

Choose a reason for hiding this comment

richardcase commented Nov 20, 2023

AndiDog Dec 21, 2023

Choose a reason for hiding this comment

richardcase commented Jan 19, 2024

k8s-ci-robot commented Feb 29, 2024

JoelSpeed commented Oct 12, 2023 •

edited

Loading