Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Do not allow overriding the Name tag so that the AWSMachine controller can reliably find its EC2 instance #4630

Closed
wants to merge 1 commit into from

Conversation

mjlshen
Copy link
Contributor

@mjlshen mjlshen commented Nov 13, 2023

/kind bug

What this PR does / why we need it:
This commit handles an edge-case where an error occurs after EC2 instance creation and before the output of a successful EC2 instance creation is returned. This causes the AWSMachine controller to attempt to find the EC2 instance by a set of filters instead of being able to explicitly find an EC2 instance by ID. Currently, the logic for finding an EC2 instance is not guaranteed to find the correct instance because one of the filters is by the Name tag, which can be overwritten by the user in the v1beta2 API.

In essence, this PR reverts #3991 because one of the assumptions in the backing issue, #3989:

As far as I can see, that tag isn't used anywhere in the codebase for matching or looking up of subnets. It seem purely for display.

Was true for subnets, but is not true for EC2 instances.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #4629

Special notes for your reviewer:
I'm not sure if this is the right way to solve this problem, two other possibilities I thought of are:

  • Allowing users to override the Name tag for subnets, but not for EC2 instances
  • Making findInstances more idempotent when overriding the Name tag
    input := &ec2.DescribeInstancesInput{
    Filters: []*ec2.Filter{
    filter.EC2.VPC(s.scope.VPC().ID),
    filter.EC2.ClusterOwned(s.scope.Name()),
    filter.EC2.Name(scope.Name()),
    filter.EC2.InstanceStates(ec2.InstanceStateNamePending, ec2.InstanceStateNameRunning),
    },
    }
    this could be done by adding a new/unique CAPA-specific "name" tag so that findInstances can filter on that new tag instead of a Name tag that a user can overwrite, resulting in a unique EC2 instance match.

Checklist:

  • squashed commits
  • includes documentation
  • adds unit tests
  • adds or updates e2e tests

Release note:

Disallow users from setting the `Name` tag on created AWS resources

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 13, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign fabriziopandini for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Nov 13, 2023
@enxebre
Copy link
Member

enxebre commented Nov 13, 2023

Thanks @mjlshen! This is awesome but I don't think it handles all the edge cases where the providerID might still not be reliable. E.g. aws API eventual consistency [1], process crash right after the API call... So the tags fallback must be trustable. The name of the replicas under a scalable resource is an implementation detail and letting it being overridden is a bug. I think we should consider reverting the original change that enabled it to be overridden, or if there's still justification for the network use case which drove that change, then we need to differentiate so this is not allowed for instances.

[1] https://docs.aws.amazon.com/AWSEC2/latest/APIReference/query-api-troubleshooting.html#eventual-consistency

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. release-note-none Denotes a PR that doesn't merit a release note. labels Nov 14, 2023
@mjlshen mjlshen changed the title Ensure providerID and instanceID get set immediately after creating an EC2 instance Do not allow overriding the Name tag so that the AWSMachine controller can reliably find its EC2 instance Nov 14, 2023
This reverts commit e0dbf3e.

This revert allows the findInstance function to reliably find the
corresponding EC2 instance for a given AWSMachine again. Previously,
when we allowed users to specify a Name tag, findInstance would not be
able to uniquely guarantee that it could find an EC2 instance since all
created EC2 instances would have the same Name tag.
@mjlshen
Copy link
Contributor Author

mjlshen commented Nov 14, 2023

@AndiDog my PR is essentially reverting your PR for #3989 - do you have a use-case for setting the Name tag on subnets?

@AndiDog
Copy link
Contributor

AndiDog commented Nov 15, 2023

Users should be able to set the Name tag since like all non AWS-internal tags, it has no specific meaning apart from display purposes (e.g. AWS Console). That goes for EC2 instances, subnets and probably most other AWS resources.

If CAPA cannot find the instance correctly because it is tagged with {AWSCluster,AWSMachine}.spec.additionalTags["Name"], as expected, instead of scope.Name() (which currently returns AWSMachine.Name), then that's the bug and the find-by-name functionality needs to be fixed instead of forbidding overrides of the value. I think the tagging code is in func (r *AWSMachineReconciler) ensureTags, but I didn't read this complex function and assume it uses the additional tags map to set the Name, and does not make use of scope.Name(). If that's correct, then filter.EC2.Name(scope.Name()), is wrong and needs fixing plus a unit test to cover the case where the name tag is overridden.

@mjlshen
Copy link
Contributor Author

mjlshen commented Nov 15, 2023

Ah the Name tag is getting set correctly on EC2 instances, but it's the same for every EC2 instance, so findInstances can no longer uniquely find an instance in certain circumstances.

I think this means we need to introduce a new tag to uniquely link an EC2 instance to an AWS machine CR

@AndiDog
Copy link
Contributor

AndiDog commented Nov 15, 2023

True, that's definitely a problem since we only reconcile a single object and can't reason about others (such as "3 AWSMachine objects with same name tag").

Can you point me to the code which sets the EC2 instance name if no AdditionalTags["name"] is given?

@mjlshen
Copy link
Contributor Author

mjlshen commented Nov 19, 2023

Sorry for the delayed response - yes!

// Make sure to use the MachineScope here to get the merger of AWSCluster and AWSMachine tags
additionalTags := scope.AdditionalTags()
input.Tags = infrav1.Build(infrav1.BuildParams{
ClusterName: s.scope.KubernetesClusterName(),
Lifecycle: infrav1.ResourceLifecycleOwned,
Name: aws.String(scope.Name()),
Role: aws.String(scope.Role()),
Additional: additionalTags,
}.WithCloudProvider(s.scope.KubernetesClusterName()).WithMachineName(scope.Machine))
sets the EC2 instance name on creation

Where infrav1.Build() is the same Build function to assemble tag precedence:

// Build builds tags including the cluster tag and returns them in map form.
func Build(params BuildParams) Tags {
tags := make(Tags)
// Add the name tag first so that it can be overwritten by a user-provided tag in the `Additional` tags.
if params.Name != nil {
tags["Name"] = *params.Name
}
for k, v := range params.Additional {
tags[k] = v
}
if params.ClusterName != "" {
tags[ClusterTagKey(params.ClusterName)] = string(params.Lifecycle)
}
if params.Role != nil {
tags[NameAWSClusterAPIRole] = *params.Role
}
return tags
}
(where an additionally specified Name tag now has precedence over the AWSMachine .metadata.name in v1beta2 now)

@richardcase
Copy link
Member

It would be good to get @AverageMarcus input on this as well.

@AverageMarcus
Copy link
Member

AverageMarcus commented Nov 20, 2023

It's been quite a while since I've been deeply involved in CAPA code so I'm gonna keep this high level...

My general thought is that the users should be able to configure any of the tags they like, including Name, so that they can adhere to any possible tag policies they have in their organisation. If we're needing to use tags for any logic based decisions we should always be using CAPA-specific tag keys so instead of using the Name tag we would instead use something like sigs.k8s.io/cluster-api-provider-aws/MachineName.

Some background on the initial introduction of the configurable Name tag:

The driving force behind the initial change for us (Giant Swarm) is we wanted to allow for some complex network configurations where different machine pools could have different subnets and wanted to have control over how those subnets were named so they'd be identifiable. I forget what exactly the problem was with them all having the same name but if I remember correctly it was related to configuring transit gateway attachments and not being able to pick the correct subnet.

@cnmcavoy
Copy link
Contributor

Just in case my comment got lost on the bug report: #4629 (comment)

I agree with @AverageMarcus

@mjlshen
Copy link
Contributor Author

mjlshen commented Nov 20, 2023

Yeah, a new tag makes sense to me too, given that we want to be able to override a Name tag. I will update this PR to go in a new direction and introduce a new CAPA-specific tag based on AWSMachine name, thanks for all the feedback!

@mjlshen mjlshen changed the title Do not allow overriding the Name tag so that the AWSMachine controller can reliably find its EC2 instance WIP: Do not allow overriding the Name tag so that the AWSMachine controller can reliably find its EC2 instance Dec 11, 2023
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 11, 2023
@k8s-ci-robot
Copy link
Contributor

@mjlshen: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-provider-aws-build-docker a93e910 link true /test pull-cluster-api-provider-aws-build-docker

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 24, 2024
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 23, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 22, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-priority needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AWSMachine controller leaks EC2 instances when specifying a Name tag in additionalTags
8 participants