Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot create a VMSS dual-stack cluster #3163

Closed
lzhecheng opened this issue Feb 9, 2023 · 6 comments · Fixed by #3361
Closed

Cannot create a VMSS dual-stack cluster #3163

lzhecheng opened this issue Feb 9, 2023 · 6 comments · Fixed by #3361
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence.
Milestone

Comments

@lzhecheng
Copy link
Contributor

/kind bug

[Before submitting an issue, have you checked the Troubleshooting Guide?]

What steps did you take and what happened:
[A clear and concise description of what the bug is.]
I tried to create a VMSS dual-stack cluster with this template. What I got is a working dual-stack control-plane Node but the VMSS Nodes don't have IPv6 address.

zhechengli@devbox:~$ kdpo -n calico-system      calico-node-gl4jp
Name:                 calico-node-gl4jp
Namespace:            calico-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 zhecheng-206-1-mp-0000000/10.1.0.4
Start Time:           Mon, 06 Feb 2023 08:28:23 +0000
Labels:               app.kubernetes.io/name=calico-node
                      controller-revision-hash=85dc7f8d55
                      k8s-app=calico-node
                      pod-template-generation=1
Annotations:          hash.operator.tigera.io/cni-config: 2d19559df12928367220b4d9c76b3883ce3e1755
                      hash.operator.tigera.io/tigera-ca-private: cfb4753a251d65748d10675458f7b80a5b261f09
Status:               Running
IP:                   10.1.0.4
IPs:
  IP:           10.1.0.4
...
    Environment:
      DATASTORE_TYPE:                      kubernetes
      WAIT_FOR_DATASTORE:                  true
      CLUSTER_TYPE:                        k8s,operator
      CALICO_DISABLE_FILE_LOGGING:         false
      FELIX_DEFAULTENDPOINTTOHOSTACTION:   ACCEPT
      FELIX_HEALTHENABLED:                 true
      FELIX_HEALTHPORT:                    9099
      NODENAME:                             (v1:spec.nodeName)
      NAMESPACE:                           calico-system (v1:metadata.namespace)
      FELIX_TYPHAK8SNAMESPACE:             calico-system
      FELIX_TYPHAK8SSERVICENAME:           calico-typha
      FELIX_TYPHACAFILE:                   /etc/pki/tls/certs/tigera-ca-bundle.crt
      FELIX_TYPHACERTFILE:                 /node-certs/tls.crt
      FELIX_TYPHAKEYFILE:                  /node-certs/tls.key
      FIPS_MODE_ENABLED:                   false
      FELIX_TYPHACN:                       typha-server
      CALICO_MANAGE_CNI:                   true
      CALICO_IPV4POOL_CIDR:                10.244.0.0/16
      CALICO_IPV4POOL_IPIP:                Never
      CALICO_IPV4POOL_BLOCK_SIZE:          26
      CALICO_IPV4POOL_NODE_SELECTOR:       all()
      CALICO_IPV4POOL_DISABLE_BGP_EXPORT:  false
      CALICO_IPV6POOL_CIDR:                2001:1234:5678:9a40::/58
root@zhecheng-206-1-mp-0000000:/# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc prio state UP group default qlen 1000
    link/ether 00:0d:3a:8a:40:7b brd ff:ff:ff:ff:ff:ff
    inet 10.1.0.4/16 metric 100 brd 10.1.255.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::20d:3aff:fe8a:407b/64 scope link
       valid_lft forever preferred_lft forever

From portal, the network interface of the VMSS VM only has IPv4

        "ipConfigurations": [
            {
                "name": "private-ipConfig-0",
                "id": "/subscriptions/xxxx/resourceGroups/zhecheng-208/providers/Microsoft.Compute/virtualMachineScaleSets/zhecheng-208-mp-0/virtualMachines/0/networkInterfaces/zhecheng-208-mp-0-0/ipConfigurations/private-ipConfig-0",
                "etag": "W/\"740ec70b-1861-4498-9c4c-8d139220b455\"",
                "properties": {
                    "provisioningState": "Succeeded",
                    "privateIPAddress": "10.1.0.4",
                    "privateIPAllocationMethod": "Dynamic",
                    "subnet": {
                        "id": "/subscriptions/xxx/resourceGroups/zhecheng-208/providers/Microsoft.Network/virtualNetworks/zhecheng-208-vnet/subnets/node-subnet"
                    },
                    "primary": true,
                    "privateIPAddressVersion": "IPv4",
                    "loadBalancerBackendAddressPools": [
                        {
                            "id": "/subscriptions/xxx/resourceGroups/zhecheng-208/providers/Microsoft.Network/loadBalancers/zhecheng-208/backendAddressPools/zhecheng-208-outboundBackendPool"
                        }
                    ]
                }
            }
        ],

More discussion in a PR: #2154 (comment)
What did you expect to happen:
A VMSS dual-stack cluster.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

  • cluster-api-provider-azure version: 1.7
  • Kubernetes version: (use kubectl version): 1.25.3
  • OS (e.g. from /etc/os-release):
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 9, 2023
@mboersma
Copy link
Contributor

mboersma commented Feb 9, 2023

/milestone v1.8

@k8s-ci-robot k8s-ci-robot added this to the v1.8 milestone Feb 9, 2023
@CecileRobertMichon
Copy link
Contributor

/assign

@CecileRobertMichon
Copy link
Contributor

@lzhecheng I found that VMSS NICs never had IPv6 addresses added so this never worked... Will work on a PR to fix it + test it this week

@CecileRobertMichon
Copy link
Contributor

@lzhecheng opened PR #3188 to fix this. However I added VMSS to the e2e ipv6 test in order to validate this and the test is failing the ILB service deletion step: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/3188/pull-cluster-api-provider-azure-e2e/1626741548278353920

In the kube-controller-manager logs I see:

E0218 00:58:31.529281       1 controller.go:320] error processing service default/web4ki2z5-ilb (will retry): failed to delete load balancer: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 400, RawError: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 400, RawError: {
  "error": {
    "code": "LoadBalancerInUseByVirtualMachineScaleSet",
    "message": "Cannot delete load balancer /subscriptions/===REDACTED===/resourceGroups/capz-e2e-uwamgq-ipv6/providers/Microsoft.Network/loadBalancers/capz-e2e-uwamgq-ipv6-internal since its child resources capz-e2e-uwamgq-ipv6-IPv6 are in use by virtual machine scale set /subscriptions/===REDACTED===/resourceGroups/capz-e2e-uwamgq-ipv6/providers/Microsoft.Compute/virtualMachineScaleSets/capz-e2e-uwamgq-ipv6-mp-0.",
    "details": []
  }
}
I0218 00:58:31.529653       1 event.go:294] "Event occurred" object="default/web4ki2z5-ilb" fieldPath="" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message=<
	Error syncing load balancer: failed to delete load balancer: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 400, RawError: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 400, RawError: {
	  "error": {
	    "code": "LoadBalancerInUseByVirtualMachineScaleSet",
	    "message": "Cannot delete load balancer /subscriptions/===REDACTED===/resourceGroups/capz-e2e-uwamgq-ipv6/providers/Microsoft.Network/loadBalancers/capz-e2e-uwamgq-ipv6-internal since its child resources capz-e2e-uwamgq-ipv6-IPv6 are in use by virtual machine scale set /subscriptions/===REDACTED===/resourceGroups/capz-e2e-uwamgq-ipv6/providers/Microsoft.Compute/virtualMachineScaleSets/capz-e2e-uwamgq-ipv6-mp-0.",
	    "details": []
	  }
	}
 >

Any ideas? This is using in-tree cloud-provider (until #1889 merges)

@lzhecheng
Copy link
Contributor Author

@CecileRobertMichon Sorry somehow I missed this message...
I cannot tell the reason from the current log. Since it is an in-tree cluster, the ccm may don't have some fixes. How about retrying after #3105?

@lzhecheng
Copy link
Contributor Author

Oh, #3105 is blocked by kubernetes-sigs/cloud-provider-azure#3401 ... Is the problem still happening?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence.
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants