Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow garbage collector to delete ec2 instances #4568

Closed
wants to merge 1 commit into from

Conversation

vincepri
Copy link
Member

What type of PR is this?

What this PR does / why we need it:

/kind feature

Addition to the experimental gc by @richardcase, we should also see if we can scan the CAPA owned tags, to make sure we don't have any leftover. Thoughts?

Release note:

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 10, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from vincepri. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added needs-priority size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 10, 2023
@vincepri
Copy link
Member Author

/retest

@vincepri
Copy link
Member Author

/test ?

@k8s-ci-robot
Copy link
Contributor

@vincepri: The following commands are available to trigger required jobs:

  • /test pull-cluster-api-provider-aws-build
  • /test pull-cluster-api-provider-aws-test
  • /test pull-cluster-api-provider-aws-verify

The following commands are available to trigger optional jobs:

  • /test pull-cluster-api-provider-aws-apidiff-main
  • /test pull-cluster-api-provider-aws-e2e
  • /test pull-cluster-api-provider-aws-e2e-blocking
  • /test pull-cluster-api-provider-aws-e2e-clusterclass
  • /test pull-cluster-api-provider-aws-e2e-conformance
  • /test pull-cluster-api-provider-aws-e2e-conformance-with-ci-artifacts
  • /test pull-cluster-api-provider-aws-e2e-eks
  • /test pull-cluster-api-provider-aws-e2e-eks-gc
  • /test pull-cluster-api-provider-aws-e2e-eks-testing

Use /test all to run the following jobs that were automatically triggered:

  • pull-cluster-api-provider-aws-apidiff-main
  • pull-cluster-api-provider-aws-build
  • pull-cluster-api-provider-aws-test
  • pull-cluster-api-provider-aws-verify

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@vincepri
Copy link
Member Author

/test pull-cluster-api-provider-aws-e2e-eks-gc

@vincepri
Copy link
Member Author

Actually seems these tests have been failing for quite some time https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-provider-aws#pr-e2e-eks-gc-main

@vincepri
Copy link
Member Author

FWIW, I've tested the logic in a custom cluster by creating an ec2 instance outside of CAPA, and it deleted properly

@richardcase
Copy link
Member

@vincepri - i'm curious in what scenarios there would be an EC2 instance created that isn't managed by CAPA itself (i.e. via Machines/MachinePools)?

The original idea of the GC was for cleaning up resources that where created by the CCM as this can block CAPA deleting a cluster. So, things like an application being deployed that has a service of type load balancer.

@vincepri
Copy link
Member Author

@richardcase A few use cases I had in mind:

  • This would allow to cleanup cluster leftovers; especially if we do expand the filters to include CAPA owned tags.
  • Allows to better support OpenShift, which creates resources using an old version of Cluster API, that tags resources with the cloud provider.

@JoelSpeed
Copy link

JoelSpeed commented Oct 12, 2023

We've certainly seen users on AWS in the past create EC2 instances by themselves and join them to OpenShift clusters when the tooling within Kube/OpenShift didn't support a feature that they needed. I expect we aren't the only people seeing users do that

Imagine before we supported say EFA networking, if a user wanted to use that, what would stop them building and adding their own EC2 instances to the workload cluster? I don't think anything would stop them, and we should expect that this has happened somewhere before, where we have feature gaps

@vincepri
Copy link
Member Author

/milestone v2.3.0

@k8s-ci-robot
Copy link
Contributor

@vincepri: You must be a member of the kubernetes-sigs/cluster-api-provider-aws-maintainers GitHub team to set the milestone. If you believe you should be able to issue the /milestone command, please contact your Cluster API Provider AWS Maintainers and have them propose you as an additional delegate for this responsibility.

In response to this:

/milestone v2.3.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@richardcase richardcase added this to the v2.3.0 milestone Oct 16, 2023
@dlipovetsky
Copy link
Contributor

Imagine before we supported say EFA networking, if a user wanted to use that, what would stop them building and adding their own EC2 instances to the workload cluster?

If the user created the EC2 instances, I think we can argue that the user is responsible for deleting them.

However, I think we can argue that the purpose of the garbage collector is to unblock cluster deletion. If a user creates an EC2 instance, CAPA does not delete it in its ordinary reconciliation, and the instance blocks deletion of the subnet, VPC, etc, and therefore blocks cluster deletion.

Therefore, we could say that the garbage collector should be extended to clean up all AWS resources that would block cluster deletion. I think the garbage collector must be "best effort," because there are edge cases. For example, some AWS resources may be missing the correct tags, and removing others may require different AWS credentials than the ones the garbage collector has.

@vincepri
Copy link
Member Author

/milestone v2.3.0

@k8s-ci-robot
Copy link
Contributor

@vincepri: You must be a member of the kubernetes-sigs/cluster-api-provider-aws-maintainers GitHub team to set the milestone. If you believe you should be able to issue the /milestone command, please contact your Cluster API Provider AWS Maintainers and have them propose you as an additional delegate for this responsibility.

In response to this:

/milestone v2.3.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@vincepri
Copy link
Member Author

@richardcase Is this good to go?

@vincepri
Copy link
Member Author

/test pull-cluster-api-provider-aws-e2e-eks-gc

@richardcase
Copy link
Member

Just testing its not a flake:

/test pull-cluster-api-provider-aws-e2e-eks-gc

(could be that #4575 will be needed...i get the failing test fixed on that PR)

@vincepri
Copy link
Member Author

/test pull-cluster-api-provider-aws-e2e-eks-gc

Copy link
Contributor

@AndiDog AndiDog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine, but are there tests covering this case already? Else we need to add tests.

@richardcase
Copy link
Member

Circling back to this now i finally have some time. After doing another review i think we probably need:

  • Unit tests covering this new deletion (thanks @AndiDog )
  • An aletrnative collection function that will descibe/get the EC2 instances like we do for the other resource types.. See this. Some partitions do not allow the use of the resource tagging api and so a fallback (i.e. alternative) is required.


instanceID := strings.ReplaceAll(resource.ARN.Resource, "instance/", "")
if err := s.deleteEC2Instance(ctx, instanceID); err != nil {
return fmt.Errorf("deleting EC2 instance %s: %w", instanceID, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I have a new request here because I stumbled over non-actionable error messages in existing GC code (they were missing the region and humans shouldn't be required to loop through all accounts and regions to find an object where only the ID is logged). I'm fixing existing code in a new PR.

Suggested change
return fmt.Errorf("deleting EC2 instance %s: %w", instanceID, err)
return fmt.Errorf("deleting EC2 instance %s with ID %s: %w", resource.ARN, instanceID, err)

@richardcase
Copy link
Member

/milestone v2.4.0

@k8s-ci-robot k8s-ci-robot modified the milestones: v2.3.0, v2.4.0 Jan 19, 2024
@richardcase richardcase removed this from the v2.4.0 milestone Jan 23, 2024
@k8s-ci-robot
Copy link
Contributor

@vincepri: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-provider-aws-build-docker 7c2d0b5 link true /test pull-cluster-api-provider-aws-build-docker

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@vincepri vincepri closed this May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. needs-priority release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants