Cluster creation fails with an error, security group and subnet for an instance belong to different networks #3399

pydctw · 2022-04-08T21:13:26Z

/kind bug

What steps did you take and what happened:
Creating a cluster using a ClusterClass fails and the log shows an error indicating that an instance creation failed with an error, failed to run instance: InvalidParameter: Security group sg-0b2785eae128cccad and subnet subnet-8b13d7d6 belong to different networks

E0408 17:14:02.788518       1 awsmachine_controller.go:497]  "msg"="unable to create instance" "error"="failed to create AWSMachine instance: failed to run instance: InvalidParameter: Security group sg-0b2785eae128cccad and subnet subnet-8b13d7d6 belong to different networks.\n\tstatus code: 400, request id: 3b28054f-c5e8-439c-bac3-0dda24431a27"

AWSCluster

spec:
  network:
    subnets:
    - availabilityZone: us-west-2a
      cidrBlock: 10.0.0.0/24
      id: subnet-0176a425f63781f71
      isPublic: false
      routeTableId: rtb-08275750c99fb2f3a
      tags:
        Name: cluster-ew1b45-subnet-private-us-west-2a
        kubernetes.io/cluster/cluster-ew1b45: shared
        kubernetes.io/role/internal-elb: "1"
        sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-ew1b45: owned
        sigs.k8s.io/cluster-api-provider-aws/role: private
    - availabilityZone: us-west-2a
      cidrBlock: 10.0.1.0/24
      id: subnet-0464f24a3d364523f
      isPublic: true
      natGatewayId: nat-0d185215367997610
      routeTableId: rtb-051ced5fd65ae6600
      tags:
        Name: cluster-ew1b45-subnet-public-us-west-2a
        kubernetes.io/cluster/cluster-ew1b45: shared
        kubernetes.io/role/elb: "1"
        sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-ew1b45: owned
        sigs.k8s.io/cluster-api-provider-aws/role: public
    - availabilityZone: us-west-2b
      cidrBlock: 10.0.2.0/24
      id: subnet-07023ae2d872062bf
      isPublic: false
      routeTableId: rtb-001434b0c17f5b0f4
      tags:
        Name: cluster-ew1b45-subnet-private-us-west-2b
        kubernetes.io/cluster/cluster-ew1b45: shared
        kubernetes.io/role/internal-elb: "1"
        sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-ew1b45: owned
        sigs.k8s.io/cluster-api-provider-aws/role: private
    - availabilityZone: us-west-2b
      cidrBlock: 10.0.3.0/24
      id: subnet-00b92bd396eef0bf2
      isPublic: true
      natGatewayId: nat-0baa238a24de3b142
      routeTableId: rtb-0e3643d5f8b441ed9
      tags:
        Name: cluster-ew1b45-subnet-public-us-west-2b
        kubernetes.io/cluster/cluster-ew1b45: shared
        kubernetes.io/role/elb: "1"
        sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-ew1b45: owned
        sigs.k8s.io/cluster-api-provider-aws/role: public

AWSMachine

spec:
  ami: {}
  cloudInit:
    secureSecretsBackend: secrets-manager
  iamInstanceProfile: control-plane.cluster-api-provider-aws.sigs.k8s.io
  instanceID: i-0ed41df7645f74b06
  instanceType: t3.large
  providerID: aws:///us-west-2b/i-0ed41df7645f74b06
  sshKeyName: cluster-api-provider-aws-sigs-k8s-io

While sg-0b2785eae128cccad belongs to a CAPA created VPC, subnet-8b13d7d6 belongs to a default VPC in the region. Note that subnet-8b13d7d6 is not referenced in AWSCluster or AWSMachine spec.

What did you expect to happen:
Cluster creation is successful.

Anything else you would like to add:
Same issue was reported by a coworker using a different ClusterClass. While he is using CAPA v1.2.0, I am using the main branch.

Also, this issue doesn't happen all the time. I've created clusters multiple times and saw the issue only a few times.

Environment:

Cluster-api-provider-aws version:
Kubernetes version: (use kubectl version):
OS (e.g. from /etc/os-release):

The text was updated successfully, but these errors were encountered:

sedefsavas · 2022-04-08T22:55:40Z

/triage accepted
/priority critical-urgent
/milestone v1.5.1

k8s-ci-robot · 2022-04-08T22:55:41Z

@sedefsavas: The provided milestone is not valid for this repository. Milestones in this repository: [Backlog, V1.5.1, v0.6.10, v0.7.4, v1.5.0, v1.x, v2.x]

Use /milestone clear to clear the milestone.

In response to this:

/triage accepted
/priority critical-urgent
/milestone v1.5.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sedefsavas · 2022-04-08T23:15:54Z

AFAIK this issue has never being observed with e2e tests without ClusterClass, so may be triggered/related to inner-workings of ClusterClass.

If so, I will reduce the priority accordingly.

pydctw · 2022-04-11T15:25:56Z

This is the first time I've seen the error and hence agree that it is ClusterClass related.

pydctw · 2022-04-13T22:31:20Z

This was such a fascinating and difficult issue to debug.

Observations

The issue happens randomly. Instance creation can fail at 1st, 2nd or 3rd CP creation.
Cluster creation is successful most of the times and it fails with the issue sometimes.
A cluster that failed an e2e test due to time out while waiting for a control plane eventually created an instance and cluster became ready.

Debugging

For a failed instance creation, below is input sent to AWS API.

{
    "instancesSet": {
      "items": [
        {
          "imageId": "ami-093e132cf8ec45d77",
          "minCount": 1,
          "maxCount": 1,
          "keyName": "cluster-api-provider-aws-sigs-k8s-io"
        }
      ]
    },
    "groupSet": {
      "items": [
        {
          "groupId": "sg-07c3eb751181ac0ab"
        },
        {
          "groupId": "sg-05683bb88ffba846b"
        },
        {
          "groupId": "sg-08f3c5c87413f9212"
        }
      ]
    },
    "userData": "<sensitiveDataRemoved>",
    "instanceType": "t3.large",
    "blockDeviceMapping": {},
    "monitoring": {
      "enabled": false
    },
    "disableApiTermination": false,
    "disableApiStop": false,
    "clientToken": "96DAC283-22A0-4195-A496-78DAA918244B",
    "iamInstanceProfile": {
      "name": "control-plane.cluster-api-provider-aws.sigs.k8s.io"
    },
    "tagSpecificationSet": {
      "items": [
        {
          "resourceType": "instance",
          "tags": [
            {
              "key": "MachineName",
              "value": "functional-test-multi-az-clusterclass-n3nuim/cluster-qmul89-v45rg-8xm24"
            },
            {
              "key": "Name",
              "value": "cluster-qmul89-control-plane-n9994-2lrrt"
            },
            {
              "key": "kubernetes.io/cluster/cluster-qmul89",
              "value": "owned"
            },
            {
              "key": "sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-qmul89",
              "value": "owned"
            },
            {
              "key": "sigs.k8s.io/cluster-api-provider-aws/role",
              "value": "control-plane"
            }
          ]
        }
      ]
    }
  }

Compare it with an input for successful case

{
    "instancesSet": {
      "items": [
        {
          "imageId": "ami-093e132cf8ec45d77",
          "minCount": 1,
          "maxCount": 1,
          "keyName": "cluster-api-provider-aws-sigs-k8s-io"
        }
      ]
    },
    "groupSet": {
      "items": [
        {
          "groupId": "sg-07c3eb751181ac0ab"
        },
        {
          "groupId": "sg-05683bb88ffba846b"
        },
        {
          "groupId": "sg-08f3c5c87413f9212"
        }
      ]
    },
    "userData": "<sensitiveDataRemoved>",
    "instanceType": "t3.large",
    "blockDeviceMapping": {},
    "monitoring": {
      "enabled": false
    },
    "subnetId": "subnet-04069978047301fce",
    "disableApiTermination": false,
    "disableApiStop": false,
    "clientToken": "0DD45959-4F4F-442C-9C8A-24D6B49239DA",
    "iamInstanceProfile": {
      "name": "control-plane.cluster-api-provider-aws.sigs.k8s.io"
    },
    "tagSpecificationSet": {
      "items": [
        {
          "resourceType": "instance",
          "tags": [
            {
              "key": "MachineName",
              "value": "functional-test-multi-az-clusterclass-n3nuim/cluster-qmul89-v45rg-8xm24"
            },
            {
              "key": "Name",
              "value": "cluster-qmul89-control-plane-n9994-2lrrt"
            },
            {
              "key": "kubernetes.io/cluster/cluster-qmul89",
              "value": "owned"
            },
            {
              "key": "sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-qmul89",
              "value": "owned"
            },
            {
              "key": "sigs.k8s.io/cluster-api-provider-aws/role",
              "value": "control-plane"
            }
          ]
        }
      ]
    }
  }

The difference is that the failed case doesn't have subnetId, which makes AWS to pick a random subnet for the instance, in this case a subnet in default VPC.

Root Cause Analysis

This happens because of an already known issue, capi-controller-manager continously patches AWSCluster object when using ClusterClass #6320

AWSCluster subnet spec oscillates between two states with ClusterClass.

After CAPA patched

  network:
    ...
    subnets:
    - availabilityZone: us-west-1a
      cidrBlock: 10.0.0.0/24
      id: subnet-04069978047301fce
      isPublic: false
      routeTableId: rtb-06e5b16760a136a9b
      tags:
        Name: cluster-qmul89-subnet-private-us-west-1a
        kubernetes.io/cluster/cluster-qmul89: shared
        kubernetes.io/role/internal-elb: "1"
        sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-qmul89: owned
        sigs.k8s.io/cluster-api-provider-aws/role: private
    - availabilityZone: us-west-1a
      cidrBlock: 10.0.1.0/24
      id: subnet-057e208911a7100a9
      isPublic: true
      natGatewayId: nat-02b99bb47ed11bab0
      routeTableId: rtb-0c1181c7a47238747
      tags:
        Name: cluster-qmul89-subnet-public-us-west-1a
        kubernetes.io/cluster/cluster-qmul89: shared
        kubernetes.io/role/elb: "1"
        sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-qmul89: owned
        sigs.k8s.io/cluster-api-provider-aws/role: public
    - availabilityZone: us-west-1c
      cidrBlock: 10.0.2.0/24
      id: subnet-0d987044191d6131a
      isPublic: false
      routeTableId: rtb-0c19e5639177973ae
      tags:
        Name: cluster-qmul89-subnet-private-us-west-1c
        kubernetes.io/cluster/cluster-qmul89: shared
        kubernetes.io/role/internal-elb: "1"
        sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-qmul89: owned
        sigs.k8s.io/cluster-api-provider-aws/role: private
    - availabilityZone: us-west-1c
      cidrBlock: 10.0.3.0/24
      id: subnet-006c42a116e38379a
      isPublic: true
      natGatewayId: nat-018176214822b0de8
      routeTableId: rtb-03e0196d18896750b
      tags:
        Name: cluster-qmul89-subnet-public-us-west-1c
        kubernetes.io/cluster/cluster-qmul89: shared
        kubernetes.io/role/elb: "1"
        sigs.k8s.io/cluster-api-provider-aws/cluster/cluster-qmul89: owned
        sigs.k8s.io/cluster-api-provider-aws/role: public

After CAPI patched

  network:
    subnets:
    - availabilityZone: us-west-1a
      cidrBlock: 10.0.0.0/24
    - availabilityZone: us-west-1a
      cidrBlock: 10.0.1.0/24
    - availabilityZone: us-west-1c
      cidrBlock: 10.0.2.0/24
    - availabilityZone: us-west-1c
      cidrBlock: 10.0.3.0/24

This instance creation fails when AWSCluster spec's subnets is on the 2nd state, when there are subnets but without IDs.
So subnet ID is empty here - https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/main/pkg/cloud/services/ec2/instances.go#L340

Fixes

While the long-term solution is waiting for the fix of kubernetes-sigs/cluster-api#6320, we can improve CAPA's subnet finding logic that assumes subnets have non-empty IDs (which has been the case)

k8s-triage-robot · 2022-07-12T23:23:49Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

pydctw · 2022-07-12T23:36:52Z

/remove-lifecycle stale

k8s-triage-robot · 2022-10-10T23:59:53Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-11-10T00:30:54Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

pydctw · 2022-11-10T00:35:21Z

This should have been fixed with SSA support in CAPA.

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 8, 2022

pydctw mentioned this issue Apr 8, 2022

Add multi-az test based on ClusterClass #3397

Closed

4 tasks

sedefsavas added this to the V1.5.1 milestone Apr 8, 2022

pydctw mentioned this issue Apr 13, 2022

Improve findSubnet logic #3419

Closed

4 tasks

pydctw mentioned this issue Jun 15, 2022

Tasks for adopting CAPI's Server Side Apply #3530

Closed

9 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 12, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 12, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 10, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 10, 2022

pydctw closed this as completed Nov 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster creation fails with an error, security group and subnet for an instance belong to different networks #3399

Cluster creation fails with an error, security group and subnet for an instance belong to different networks #3399

pydctw commented Apr 8, 2022 •

edited

Loading

sedefsavas commented Apr 8, 2022

k8s-ci-robot commented Apr 8, 2022

sedefsavas commented Apr 8, 2022 •

edited

Loading

pydctw commented Apr 11, 2022

pydctw commented Apr 13, 2022 •

edited

Loading

k8s-triage-robot commented Jul 12, 2022

pydctw commented Jul 12, 2022

k8s-triage-robot commented Oct 10, 2022

k8s-triage-robot commented Nov 10, 2022

pydctw commented Nov 10, 2022

Cluster creation fails with an error, security group and subnet for an instance belong to different networks #3399

Cluster creation fails with an error, security group and subnet for an instance belong to different networks #3399

Comments

pydctw commented Apr 8, 2022 • edited Loading

sedefsavas commented Apr 8, 2022

k8s-ci-robot commented Apr 8, 2022

sedefsavas commented Apr 8, 2022 • edited Loading

pydctw commented Apr 11, 2022

pydctw commented Apr 13, 2022 • edited Loading

Observations

Debugging

Root Cause Analysis

Fixes

k8s-triage-robot commented Jul 12, 2022

pydctw commented Jul 12, 2022

k8s-triage-robot commented Oct 10, 2022

k8s-triage-robot commented Nov 10, 2022

pydctw commented Nov 10, 2022

pydctw commented Apr 8, 2022 •

edited

Loading

sedefsavas commented Apr 8, 2022 •

edited

Loading

pydctw commented Apr 13, 2022 •

edited

Loading