AWSMachine controller leaks EC2 instances when specifying a Name tag in `additionalTags` #4629

mjlshen · 2023-11-10T17:21:55Z

/kind bug

What steps did you take and what happened:
After applying two infrastructure.cluster.x-k8s.io/v1beta2 AWSMachines with the following additionalTags:

spec:
  additionalTags:
    Name: sample-name-tag

124 EC2 instances were created in quick succession for these AWSMachines before .spec.providerID was populated. All of the EC2 instances (leaked and not-leaked) do have the tags Name: sample-name-tag

❯ k get awsmachine                                                  
NAME                                    CLUSTER                            STATE     READY   INSTANCEID                              MACHINE
sample-name-tag-251d8688-cjd7r   ${ID}   running   true    aws:///us-east-2a/i-019c51ee0ded3e51a   sample-name-tag-workers-55b788fd4dx5n4zb-kn2h6
sample-name-tag-251d8688-hwhb4   ${ID}   running   true    aws:///us-east-2a/i-0c9169fb1781d2a8e   sample-name-tag-workers-55b788fd4dx5n4zb-8brbs

What did you expect to happen:
One EC2 instance to be created per AWSMachine, with the tag Name: sample-name-tag

Anything else you would like to add:
I think this occurs because:

We try to find an instance during a normal reconcile loop -

cluster-api-provider-aws/controllers/awsmachine_controller.go

Lines 483 to 489 in 56c9a39

    
           // Find existing instance 
        
           instance, err := r.findInstance(machineScope, ec2svc) 
        
           if err != nil { 
        
           	machineScope.Error(err, "unable to find instance") 
        
           	conditions.MarkUnknown(machineScope.AWSMachine, infrav1.InstanceReadyCondition, infrav1.InstanceNotFoundReason, err.Error()) 
        
           	return ctrl.Result{}, err 
        
           }

We look for the instance with these tags, including the name of the AWSMachine

cluster-api-provider-aws/pkg/cloud/services/ec2/instances.go

Lines 46 to 53 in abd444c

    
           input := &ec2.DescribeInstancesInput{ 
        
           	Filters: []*ec2.Filter{ 
        
           		filter.EC2.VPC(s.scope.VPC().ID), 
        
           		filter.EC2.ClusterOwned(s.scope.Name()), 
        
           		filter.EC2.Name(scope.Name()), 
        
           		filter.EC2.InstanceStates(ec2.InstanceStateNamePending, ec2.InstanceStateNameRunning), 
        
           	}, 
        
           }

If instance == nil, then a new EC2 instance is created https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/56c9a39dd834640ee4a027e679ad2a5757098dfd/controllers/awsmachine_controller.go#L517C29-L522
However, this only happens when the instance cannot be found by tags (or an AWS not found/permission error). So the bug is with a change in v1beta2 to allow overwriting the Name tag: https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/main/api/v1beta2/tags.go#L244-L264

vs v1beta1

cluster-api-provider-aws/api/v1beta1/tags.go

Lines 250 to 269 in abd444c

    
           // Build builds tags including the cluster tag and returns them in map form. 
        
           func Build(params BuildParams) Tags { 
        
           	tags := make(Tags) 
        
           	for k, v := range params.Additional { 
        
           		tags[k] = v 
        
           	} 
        
           	if params.ClusterName != "" { 
        
           		tags[ClusterTagKey(params.ClusterName)] = string(params.Lifecycle) 
        
           	} 
        
           	if params.Role != nil { 
        
           		tags[NameAWSClusterAPIRole] = *params.Role 
        
           	} 
        
           	if params.Name != nil { 
        
           		tags["Name"] = *params.Name 
        
           	} 
        
           	return tags 
        
           }

where the AWSMachine's own .metadata.name had precedence

So I think we need to either not allow overwriting the Name tag or we need to be smarter about how we find the instance based off of tags. Even if we update the filtering logic to account for the additional Name tag, we don't want to find an instance that's already been claimed by another AWSMachine.

Environment:

Cluster-api-provider-aws version: v2.0.2
Kubernetes version: (use kubectl version):

Client Version: v1.28.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.27.6+f67aeb3

OS (e.g. from /etc/os-release):

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2023-11-10T17:22:04Z

This issue is currently awaiting triage.

If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mjlshen · 2023-11-13T03:10:09Z

As for why findInstance even needs to be called - normally providerID is populated when an EC2 instance has been created at least once, however I noticed that this bug occurred in my case because CAPA did not have the ability to perform ec2:CreateTags on ENIs

cluster-api-provider-aws/pkg/cloud/services/ec2/instances.go

Lines 272 to 281 in abd444c

    
           // Once all the network interfaces attached to the specific instanace are found, the similar tags of instance are created for network interfaces too 
        
           if len(networkInterfaces) > 0 { 
        
           	s.scope.Debug("Attempting to create tags from resource", "resource-id", out.ID) 
        
           	for _, networkInterface := range networkInterfaces { 
        
           		// Create/Update tags in AWS. 
        
           		if err := s.UpdateResourceTags(networkInterface.NetworkInterfaceId, out.Tags, nil); err != nil { 
        
           			return nil, errors.Wrapf(err, "failed to create tags for resource %q: ", *networkInterface.NetworkInterfaceId) 
        
           		} 
        
           	} 
        
           }

so the providerID was never returned and lost, resulting in findInstance needing to fallback to tag-based detection

cnmcavoy · 2023-11-16T18:46:44Z

I am surprised the AWSMachine controller uses Name as a key to look up AWS resources. I would have expected to fallback to a tag key CAPA explicitly controlled, e.g sigs.k8s.io/cluster-api-provider-aws/aws-machine. Perhaps CAPA should use a tag key it knows it can own and control rather than one that users expect to manage, like Name, as Name exists mostly for the human UI in the first place.

k8s-ci-robot · 2023-11-22T11:12:37Z

@celebdor: Re-titling can only be requested by trusted users, like repository collaborators.

In response to this:

/retitle OCPBUGS-23174: AWSMachine controller leaks EC2 instances when specifying a Name tag in additionalTags

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

celebdor · 2023-11-22T11:13:10Z

sorry for the noise

dlipovetsky · 2023-12-11T17:22:52Z

From today's office hours: I wonder if can address this once and for all by using the client token strategy described in https://docs.aws.amazon.com/AWSEC2/latest/APIReference/Run_Instance_Idempotency.html

k8s-triage-robot · 2024-03-10T18:10:57Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-04-09T18:18:03Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-05-09T19:14:29Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-05-09T19:14:34Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-priority labels Nov 10, 2023

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Nov 10, 2023

mjlshen mentioned this issue Nov 13, 2023

WIP: Do not allow overriding the Name tag so that the AWSMachine controller can reliably find its EC2 instance #4630

Closed

4 tasks

mjlshen mentioned this issue Nov 30, 2023

🐛 Update awsmachine providerID and instanceID immediately after ec2:RunInstances is called #4670

Merged

5 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 10, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 9, 2024

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWSMachine controller leaks EC2 instances when specifying a Name tag in `additionalTags` #4629

AWSMachine controller leaks EC2 instances when specifying a Name tag in `additionalTags` #4629

mjlshen commented Nov 10, 2023 •

edited

Loading

k8s-ci-robot commented Nov 10, 2023

mjlshen commented Nov 13, 2023

cnmcavoy commented Nov 16, 2023 •

edited

Loading

k8s-ci-robot commented Nov 22, 2023

celebdor commented Nov 22, 2023

dlipovetsky commented Dec 11, 2023

k8s-triage-robot commented Mar 10, 2024

k8s-triage-robot commented Apr 9, 2024

k8s-triage-robot commented May 9, 2024

k8s-ci-robot commented May 9, 2024

AWSMachine controller leaks EC2 instances when specifying a Name tag in additionalTags #4629

AWSMachine controller leaks EC2 instances when specifying a Name tag in additionalTags #4629

Comments

mjlshen commented Nov 10, 2023 • edited Loading

k8s-ci-robot commented Nov 10, 2023

mjlshen commented Nov 13, 2023

cnmcavoy commented Nov 16, 2023 • edited Loading

k8s-ci-robot commented Nov 22, 2023

celebdor commented Nov 22, 2023

dlipovetsky commented Dec 11, 2023

k8s-triage-robot commented Mar 10, 2024

k8s-triage-robot commented Apr 9, 2024

k8s-triage-robot commented May 9, 2024

k8s-ci-robot commented May 9, 2024

AWSMachine controller leaks EC2 instances when specifying a Name tag in `additionalTags` #4629

AWSMachine controller leaks EC2 instances when specifying a Name tag in `additionalTags` #4629

mjlshen commented Nov 10, 2023 •

edited

Loading

cnmcavoy commented Nov 16, 2023 •

edited

Loading