WIP: Do not allow overriding the `Name` tag so that the AWSMachine controller can reliably find its EC2 instance #4630

mjlshen · 2023-11-13T03:32:59Z

/kind bug

What this PR does / why we need it:
This commit handles an edge-case where an error occurs after EC2 instance creation and before the output of a successful EC2 instance creation is returned. This causes the AWSMachine controller to attempt to find the EC2 instance by a set of filters instead of being able to explicitly find an EC2 instance by ID. Currently, the logic for finding an EC2 instance is not guaranteed to find the correct instance because one of the filters is by the Name tag, which can be overwritten by the user in the v1beta2 API.

In essence, this PR reverts #3991 because one of the assumptions in the backing issue, #3989:

As far as I can see, that tag isn't used anywhere in the codebase for matching or looking up of subnets. It seem purely for display.

Was true for subnets, but is not true for EC2 instances.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #4629

Special notes for your reviewer:
I'm not sure if this is the right way to solve this problem, two other possibilities I thought of are:

Allowing users to override the Name tag for subnets, but not for EC2 instances

Making findInstances more idempotent when overriding the Name tag

cluster-api-provider-aws/pkg/cloud/services/ec2/instances.go

Lines 46 to 53 in abd444c

    
           input := &ec2.DescribeInstancesInput{ 
        
           	Filters: []*ec2.Filter{ 
        
           		filter.EC2.VPC(s.scope.VPC().ID), 
        
           		filter.EC2.ClusterOwned(s.scope.Name()), 
        
           		filter.EC2.Name(scope.Name()), 
        
           		filter.EC2.InstanceStates(ec2.InstanceStateNamePending, ec2.InstanceStateNameRunning), 
        
           	}, 
        
           }

this could be done by adding a new/unique CAPA-specific "name" tag so that findInstances can filter on that new tag instead of a Name tag that a user can overwrite, resulting in a unique EC2 instance match.

Checklist:

squashed commits
includes documentation
adds unit tests
adds or updates e2e tests

Release note:

Disallow users from setting the `Name` tag on created AWS resources

k8s-ci-robot · 2023-11-13T03:33:08Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign fabriziopandini for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

enxebre · 2023-11-13T08:51:29Z

Thanks @mjlshen! This is awesome but I don't think it handles all the edge cases where the providerID might still not be reliable. E.g. aws API eventual consistency [1], process crash right after the API call... So the tags fallback must be trustable. The name of the replicas under a scalable resource is an implementation detail and letting it being overridden is a bug. I think we should consider reverting the original change that enabled it to be overridden, or if there's still justification for the network use case which drove that change, then we need to differentiate so this is not allowed for instances.

[1] https://docs.aws.amazon.com/AWSEC2/latest/APIReference/query-api-troubleshooting.html#eventual-consistency

This reverts commit e0dbf3e. This revert allows the findInstance function to reliably find the corresponding EC2 instance for a given AWSMachine again. Previously, when we allowed users to specify a Name tag, findInstance would not be able to uniquely guarantee that it could find an EC2 instance since all created EC2 instances would have the same Name tag.

mjlshen · 2023-11-14T03:57:37Z

@AndiDog my PR is essentially reverting your PR for #3989 - do you have a use-case for setting the Name tag on subnets?

AndiDog · 2023-11-15T08:39:19Z

Users should be able to set the Name tag since like all non AWS-internal tags, it has no specific meaning apart from display purposes (e.g. AWS Console). That goes for EC2 instances, subnets and probably most other AWS resources.

If CAPA cannot find the instance correctly because it is tagged with {AWSCluster,AWSMachine}.spec.additionalTags["Name"], as expected, instead of scope.Name() (which currently returns AWSMachine.Name), then that's the bug and the find-by-name functionality needs to be fixed instead of forbidding overrides of the value. I think the tagging code is in func (r *AWSMachineReconciler) ensureTags, but I didn't read this complex function and assume it uses the additional tags map to set the Name, and does not make use of scope.Name(). If that's correct, then filter.EC2.Name(scope.Name()), is wrong and needs fixing plus a unit test to cover the case where the name tag is overridden.

mjlshen · 2023-11-15T15:23:39Z

Ah the Name tag is getting set correctly on EC2 instances, but it's the same for every EC2 instance, so findInstances can no longer uniquely find an instance in certain circumstances.

I think this means we need to introduce a new tag to uniquely link an EC2 instance to an AWS machine CR

AndiDog · 2023-11-15T18:11:11Z

True, that's definitely a problem since we only reconcile a single object and can't reason about others (such as "3 AWSMachine objects with same name tag").

Can you point me to the code which sets the EC2 instance name if no AdditionalTags["name"] is given?

mjlshen · 2023-11-19T20:36:54Z

Sorry for the delayed response - yes!

cluster-api-provider-aws/pkg/cloud/services/ec2/instances.go

Lines 123 to 131 in b003e69

    
           // Make sure to use the MachineScope here to get the merger of AWSCluster and AWSMachine tags 
        
           additionalTags := scope.AdditionalTags() 
        
           input.Tags = infrav1.Build(infrav1.BuildParams{ 
        
           	ClusterName: s.scope.KubernetesClusterName(), 
        
           	Lifecycle:   infrav1.ResourceLifecycleOwned, 
        
           	Name:        aws.String(scope.Name()), 
        
           	Role:        aws.String(scope.Role()), 
        
           	Additional:  additionalTags, 
        
           }.WithCloudProvider(s.scope.KubernetesClusterName()).WithMachineName(scope.Machine))

sets the EC2 instance name on creation

Where infrav1.Build() is the same Build function to assemble tag precedence:

cluster-api-provider-aws/api/v1beta2/tags.go

Lines 243 to 264 in abd444c

    
           // Build builds tags including the cluster tag and returns them in map form. 
        
           func Build(params BuildParams) Tags { 
        
           	tags := make(Tags) 
        
           	// Add the name tag first so that it can be overwritten by a user-provided tag in the `Additional` tags. 
        
           	if params.Name != nil { 
        
           		tags["Name"] = *params.Name 
        
           	} 
        
           	for k, v := range params.Additional { 
        
           		tags[k] = v 
        
           	} 
        
           	if params.ClusterName != "" { 
        
           		tags[ClusterTagKey(params.ClusterName)] = string(params.Lifecycle) 
        
           	} 
        
           	if params.Role != nil { 
        
           		tags[NameAWSClusterAPIRole] = *params.Role 
        
           	} 
        
           	return tags 
        
           }

(where an additionally specified Name tag now has precedence over the AWSMachine .metadata.name in v1beta2 now)

richardcase · 2023-11-20T08:57:20Z

It would be good to get @AverageMarcus input on this as well.

AverageMarcus · 2023-11-20T09:16:30Z

It's been quite a while since I've been deeply involved in CAPA code so I'm gonna keep this high level...

My general thought is that the users should be able to configure any of the tags they like, including Name, so that they can adhere to any possible tag policies they have in their organisation. If we're needing to use tags for any logic based decisions we should always be using CAPA-specific tag keys so instead of using the Name tag we would instead use something like sigs.k8s.io/cluster-api-provider-aws/MachineName.

Some background on the initial introduction of the configurable Name tag:

The driving force behind the initial change for us (Giant Swarm) is we wanted to allow for some complex network configurations where different machine pools could have different subnets and wanted to have control over how those subnets were named so they'd be identifiable. I forget what exactly the problem was with them all having the same name but if I remember correctly it was related to configuring transit gateway attachments and not being able to pick the correct subnet.

cnmcavoy · 2023-11-20T19:10:34Z

Just in case my comment got lost on the bug report: #4629 (comment)

I agree with @AverageMarcus

mjlshen · 2023-11-20T20:26:33Z

Yeah, a new tag makes sense to me too, given that we want to be able to override a Name tag. I will update this PR to go in a new direction and introduce a new CAPA-specific tag based on AWSMachine name, thanks for all the feedback!

k8s-ci-robot · 2024-02-29T12:21:50Z

@mjlshen: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-cluster-api-provider-aws-build-docker	`a93e910`	link	true	`/test pull-cluster-api-provider-aws-build-docker`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-ci-robot · 2024-04-24T05:45:31Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-triage-robot · 2024-07-23T06:05:52Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-08-22T06:51:33Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-09-21T07:51:21Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen
Mark this PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2024-09-21T07:51:25Z

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen

Mark this PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 13, 2023

k8s-ci-robot requested review from Ankitasw and AverageMarcus November 13, 2023 03:33

k8s-ci-robot added the needs-priority label Nov 13, 2023

k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Nov 13, 2023

mjlshen force-pushed the 4629 branch from 04d4428 to 3d5b44d Compare November 14, 2023 03:47

mjlshen changed the title ~~Ensure providerID and instanceID get set immediately after creating an EC2 instance~~ Do not allow overriding the Name tag so that the AWSMachine controller can reliably find its EC2 instance Nov 14, 2023

mjlshen force-pushed the 4629 branch from 3d5b44d to a93e910 Compare November 14, 2023 03:54

mjlshen mentioned this pull request Nov 30, 2023

🐛 Update awsmachine providerID and instanceID immediately after ec2:RunInstances is called #4670

Merged

5 tasks

mjlshen changed the title ~~Do not allow overriding the Name tag so that the AWSMachine controller can reliably find its EC2 instance~~ WIP: Do not allow overriding the Name tag so that the AWSMachine controller can reliably find its EC2 instance Dec 11, 2023

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 11, 2023

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 24, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 23, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 22, 2024

k8s-ci-robot closed this Sep 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Do not allow overriding the `Name` tag so that the AWSMachine controller can reliably find its EC2 instance #4630

WIP: Do not allow overriding the `Name` tag so that the AWSMachine controller can reliably find its EC2 instance #4630

mjlshen commented Nov 13, 2023 •

edited

Loading

k8s-ci-robot commented Nov 13, 2023

enxebre commented Nov 13, 2023 •

edited

Loading

mjlshen commented Nov 14, 2023

AndiDog commented Nov 15, 2023

mjlshen commented Nov 15, 2023

AndiDog commented Nov 15, 2023

mjlshen commented Nov 19, 2023

richardcase commented Nov 20, 2023

AverageMarcus commented Nov 20, 2023 •

edited

Loading

cnmcavoy commented Nov 20, 2023

mjlshen commented Nov 20, 2023

k8s-ci-robot commented Feb 29, 2024

k8s-ci-robot commented Apr 24, 2024

k8s-triage-robot commented Jul 23, 2024

k8s-triage-robot commented Aug 22, 2024

k8s-triage-robot commented Sep 21, 2024

k8s-ci-robot commented Sep 21, 2024

	input := &ec2.DescribeInstancesInput{
	Filters: []*ec2.Filter{
	filter.EC2.VPC(s.scope.VPC().ID),
	filter.EC2.ClusterOwned(s.scope.Name()),
	filter.EC2.Name(scope.Name()),
	filter.EC2.InstanceStates(ec2.InstanceStateNamePending, ec2.InstanceStateNameRunning),
	},
	}

WIP: Do not allow overriding the Name tag so that the AWSMachine controller can reliably find its EC2 instance #4630

WIP: Do not allow overriding the Name tag so that the AWSMachine controller can reliably find its EC2 instance #4630

Conversation

mjlshen commented Nov 13, 2023 • edited Loading

k8s-ci-robot commented Nov 13, 2023

enxebre commented Nov 13, 2023 • edited Loading

mjlshen commented Nov 14, 2023

AndiDog commented Nov 15, 2023

mjlshen commented Nov 15, 2023

AndiDog commented Nov 15, 2023

mjlshen commented Nov 19, 2023

richardcase commented Nov 20, 2023

AverageMarcus commented Nov 20, 2023 • edited Loading

cnmcavoy commented Nov 20, 2023

mjlshen commented Nov 20, 2023

k8s-ci-robot commented Feb 29, 2024

k8s-ci-robot commented Apr 24, 2024

k8s-triage-robot commented Jul 23, 2024

k8s-triage-robot commented Aug 22, 2024

k8s-triage-robot commented Sep 21, 2024

k8s-ci-robot commented Sep 21, 2024

WIP: Do not allow overriding the `Name` tag so that the AWSMachine controller can reliably find its EC2 instance #4630

WIP: Do not allow overriding the `Name` tag so that the AWSMachine controller can reliably find its EC2 instance #4630

mjlshen commented Nov 13, 2023 •

edited

Loading

enxebre commented Nov 13, 2023 •

edited

Loading

AverageMarcus commented Nov 20, 2023 •

edited

Loading