Cannot create a VMSS dual-stack cluster #3163

lzhecheng · 2023-02-09T01:50:19Z

/kind bug

[Before submitting an issue, have you checked the Troubleshooting Guide?]

What steps did you take and what happened:
[A clear and concise description of what the bug is.]
I tried to create a VMSS dual-stack cluster with this template. What I got is a working dual-stack control-plane Node but the VMSS Nodes don't have IPv6 address.

zhechengli@devbox:~$ kdpo -n calico-system      calico-node-gl4jp
Name:                 calico-node-gl4jp
Namespace:            calico-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 zhecheng-206-1-mp-0000000/10.1.0.4
Start Time:           Mon, 06 Feb 2023 08:28:23 +0000
Labels:               app.kubernetes.io/name=calico-node
                      controller-revision-hash=85dc7f8d55
                      k8s-app=calico-node
                      pod-template-generation=1
Annotations:          hash.operator.tigera.io/cni-config: 2d19559df12928367220b4d9c76b3883ce3e1755
                      hash.operator.tigera.io/tigera-ca-private: cfb4753a251d65748d10675458f7b80a5b261f09
Status:               Running
IP:                   10.1.0.4
IPs:
  IP:           10.1.0.4
...
    Environment:
      DATASTORE_TYPE:                      kubernetes
      WAIT_FOR_DATASTORE:                  true
      CLUSTER_TYPE:                        k8s,operator
      CALICO_DISABLE_FILE_LOGGING:         false
      FELIX_DEFAULTENDPOINTTOHOSTACTION:   ACCEPT
      FELIX_HEALTHENABLED:                 true
      FELIX_HEALTHPORT:                    9099
      NODENAME:                             (v1:spec.nodeName)
      NAMESPACE:                           calico-system (v1:metadata.namespace)
      FELIX_TYPHAK8SNAMESPACE:             calico-system
      FELIX_TYPHAK8SSERVICENAME:           calico-typha
      FELIX_TYPHACAFILE:                   /etc/pki/tls/certs/tigera-ca-bundle.crt
      FELIX_TYPHACERTFILE:                 /node-certs/tls.crt
      FELIX_TYPHAKEYFILE:                  /node-certs/tls.key
      FIPS_MODE_ENABLED:                   false
      FELIX_TYPHACN:                       typha-server
      CALICO_MANAGE_CNI:                   true
      CALICO_IPV4POOL_CIDR:                10.244.0.0/16
      CALICO_IPV4POOL_IPIP:                Never
      CALICO_IPV4POOL_BLOCK_SIZE:          26
      CALICO_IPV4POOL_NODE_SELECTOR:       all()
      CALICO_IPV4POOL_DISABLE_BGP_EXPORT:  false
      CALICO_IPV6POOL_CIDR:                2001:1234:5678:9a40::/58

root@zhecheng-206-1-mp-0000000:/# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc prio state UP group default qlen 1000
    link/ether 00:0d:3a:8a:40:7b brd ff:ff:ff:ff:ff:ff
    inet 10.1.0.4/16 metric 100 brd 10.1.255.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::20d:3aff:fe8a:407b/64 scope link
       valid_lft forever preferred_lft forever

From portal, the network interface of the VMSS VM only has IPv4

        "ipConfigurations": [
            {
                "name": "private-ipConfig-0",
                "id": "/subscriptions/xxxx/resourceGroups/zhecheng-208/providers/Microsoft.Compute/virtualMachineScaleSets/zhecheng-208-mp-0/virtualMachines/0/networkInterfaces/zhecheng-208-mp-0-0/ipConfigurations/private-ipConfig-0",
                "etag": "W/\"740ec70b-1861-4498-9c4c-8d139220b455\"",
                "properties": {
                    "provisioningState": "Succeeded",
                    "privateIPAddress": "10.1.0.4",
                    "privateIPAllocationMethod": "Dynamic",
                    "subnet": {
                        "id": "/subscriptions/xxx/resourceGroups/zhecheng-208/providers/Microsoft.Network/virtualNetworks/zhecheng-208-vnet/subnets/node-subnet"
                    },
                    "primary": true,
                    "privateIPAddressVersion": "IPv4",
                    "loadBalancerBackendAddressPools": [
                        {
                            "id": "/subscriptions/xxx/resourceGroups/zhecheng-208/providers/Microsoft.Network/loadBalancers/zhecheng-208/backendAddressPools/zhecheng-208-outboundBackendPool"
                        }
                    ]
                }
            }
        ],

More discussion in a PR: #2154 (comment)
What did you expect to happen:
A VMSS dual-stack cluster.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

cluster-api-provider-azure version: 1.7
Kubernetes version: (use kubectl version): 1.25.3
OS (e.g. from /etc/os-release):

The text was updated successfully, but these errors were encountered:

mboersma · 2023-02-09T17:30:36Z

/milestone v1.8

CecileRobertMichon · 2023-02-11T00:16:11Z

/assign

CecileRobertMichon · 2023-02-14T00:22:34Z

@lzhecheng I found that VMSS NICs never had IPv6 addresses added so this never worked... Will work on a PR to fix it + test it this week

CecileRobertMichon · 2023-02-18T02:23:44Z

@lzhecheng opened PR #3188 to fix this. However I added VMSS to the e2e ipv6 test in order to validate this and the test is failing the ILB service deletion step: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/3188/pull-cluster-api-provider-azure-e2e/1626741548278353920

In the kube-controller-manager logs I see:

E0218 00:58:31.529281       1 controller.go:320] error processing service default/web4ki2z5-ilb (will retry): failed to delete load balancer: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 400, RawError: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 400, RawError: {
  "error": {
    "code": "LoadBalancerInUseByVirtualMachineScaleSet",
    "message": "Cannot delete load balancer /subscriptions/===REDACTED===/resourceGroups/capz-e2e-uwamgq-ipv6/providers/Microsoft.Network/loadBalancers/capz-e2e-uwamgq-ipv6-internal since its child resources capz-e2e-uwamgq-ipv6-IPv6 are in use by virtual machine scale set /subscriptions/===REDACTED===/resourceGroups/capz-e2e-uwamgq-ipv6/providers/Microsoft.Compute/virtualMachineScaleSets/capz-e2e-uwamgq-ipv6-mp-0.",
    "details": []
  }
}
I0218 00:58:31.529653       1 event.go:294] "Event occurred" object="default/web4ki2z5-ilb" fieldPath="" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message=<
	Error syncing load balancer: failed to delete load balancer: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 400, RawError: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 400, RawError: {
	  "error": {
	    "code": "LoadBalancerInUseByVirtualMachineScaleSet",
	    "message": "Cannot delete load balancer /subscriptions/===REDACTED===/resourceGroups/capz-e2e-uwamgq-ipv6/providers/Microsoft.Network/loadBalancers/capz-e2e-uwamgq-ipv6-internal since its child resources capz-e2e-uwamgq-ipv6-IPv6 are in use by virtual machine scale set /subscriptions/===REDACTED===/resourceGroups/capz-e2e-uwamgq-ipv6/providers/Microsoft.Compute/virtualMachineScaleSets/capz-e2e-uwamgq-ipv6-mp-0.",
	    "details": []
	  }
	}
 >

Any ideas? This is using in-tree cloud-provider (until #1889 merges)

lzhecheng · 2023-03-02T10:12:26Z

@CecileRobertMichon Sorry somehow I missed this message...
I cannot tell the reason from the current log. Since it is an in-tree cluster, the ccm may don't have some fixes. How about retrying after #3105?

lzhecheng · 2023-03-02T10:28:56Z

Oh, #3105 is blocked by kubernetes-sigs/cloud-provider-azure#3401 ... Is the problem still happening?

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 9, 2023

k8s-ci-robot added this to the v1.8 milestone Feb 9, 2023

k8s-ci-robot assigned CecileRobertMichon Feb 11, 2023

CecileRobertMichon mentioned this issue Feb 17, 2023

Refactor scalesets NIC config #3188

Merged

3 tasks

CecileRobertMichon mentioned this issue Mar 6, 2023

MachinePools/VMSS NetworkInterfaces wrong configuration #3224

Closed

mboersma modified the milestones: v1.8, v1.9 Mar 9, 2023

CecileRobertMichon mentioned this issue Mar 24, 2023

Add ipv6 IP configs for VMSS #3361

Merged

3 tasks

CecileRobertMichon added this to CAPZ Planning Apr 5, 2023

dtzar added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Apr 5, 2023

k8s-ci-robot closed this as completed in #3361 May 2, 2023

github-project-automation bot moved this to Done in CAPZ Planning May 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot create a VMSS dual-stack cluster #3163

Cannot create a VMSS dual-stack cluster #3163

lzhecheng commented Feb 9, 2023

mboersma commented Feb 9, 2023

CecileRobertMichon commented Feb 11, 2023

CecileRobertMichon commented Feb 14, 2023

CecileRobertMichon commented Feb 18, 2023

lzhecheng commented Mar 2, 2023

lzhecheng commented Mar 2, 2023

Cannot create a VMSS dual-stack cluster #3163

Cannot create a VMSS dual-stack cluster #3163

Comments

lzhecheng commented Feb 9, 2023

mboersma commented Feb 9, 2023

CecileRobertMichon commented Feb 11, 2023

CecileRobertMichon commented Feb 14, 2023

CecileRobertMichon commented Feb 18, 2023

lzhecheng commented Mar 2, 2023

lzhecheng commented Mar 2, 2023