When running a workload with a single control plane node the load balancers take 15 mins to provision #857

jsturtevant · 2020-08-03T16:09:46Z

/kind bug

status (as of 6/10/21):

fixed in out of tree cloud provider in Fix availability set cache in vmss cache cloud-provider-azure#537 (included in v0.7.3+)
cherry picked into in tree cloud provider in Cherry pick #537 from cloud provider azure: Refresh VM cache when node is not found kubernetes/kubernetes#100110 (included in k8s 1.22+)

What steps did you take and what happened:
[A clear and concise description of what the bug is.]
When running a workload with a single control plane node the load balancers take 15 mins to provision.

Add the following to the Creating a single control-plane cluster with 1 worker node e2e test after cluster creation:

AzureLBSpec(ctx, func() AzureLBSpecInput {
					return AzureLBSpecInput{
						BootstrapClusterProxy: bootstrapClusterProxy,
						Namespace:             namespace,
						ClusterName:           clusterName,
						SkipCleanup:           skipCleanup,
					}
				})

Run the e2e test and the test will fail:

./scripts/ci-e2e.sh

## commented out

Workload cluster creation                                                                                                                                                                           
[1] /home/jstur/projects/cluster-api-provider-azure/test/e2e/azure_test.go:36                                                                                                                           
[1]   Creating a single control-plane cluster                                                                                                                                                           
[1]   /home/jstur/projects/cluster-api-provider-azure/test/e2e/azure_test.go:71                                                                                                                         
[1]     With 1 worker node [It]                                                                                                                                                                         
[1]     /home/jstur/projects/cluster-api-provider-azure/test/e2e/azure_test.go:72                                                                                                                       
[1]                                                                                                                                                                                                     
[1]     Timed out after 180.000s.                                                                                                                                                                       
[1]     Service default/ingress-nginx-ilb failed to get an IP for LoadBalancer.Ingress                                                                                                                  
[1]     Expected                                                                                                                                                                                        
[1]         <bool>: false                                                                                                                                                                               
[1]     to be true                                                                                                                                                                                      
[1]                                                                                                                                                                                                     
[1]     /home/jstur/projects/cluster-api-provider-azure/test/e2e/helpers.go:97

If you connect to the workload cluster you will see the service with the load balancer is there and after 15 mins will provision. Subsequent services with load balancers will provision quickly. The logs of the controller manager will contain:

E0802 23:30:42.090814       1 azure_vmss.go:1116] EnsureHostInPool(default/ingress-nginx-ilb): backendPoolID(/subscriptions/b9d9436a-0c07-4fe8-b779-2c1030bd7997/resourceGroups/capz-e2e-72fll1/providers/Microsoft.Network/loadBalancers/capz-e2e-72fll1-internal/backendAddressPools/capz-e2e-72fll1) - failed to ensure host in pool: "not a vmss instance"

What did you expect to happen:
That the tests should be able to provision a workload cluster and pass a e2e test that creates a load balancer.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

This is related to kubernetes-sigs/cloud-provider-azure#338

Environment:

cluster-api-provider-azure version:
Kubernetes version: (use kubectl version):
OS (e.g. from /etc/os-release):

The text was updated successfully, but these errors were encountered:

alexeldeib · 2020-08-03T16:13:59Z

Guess this is another code path related to not using Availability Sets? We should probably consider that as mitigation. Anything that tries to look up IDs will fail with our current setup. It's hard to track down all the places individually.

jsturtevant · 2020-08-03T16:38:01Z

It is related but not the root cause, The root cause is the cache used in the controller-manager. I provided more details in kubernetes-sigs/cloud-provider-azure#363.

I believe this could cause delays in a customer scenario where a node is added after the the cluster is provisioned causing a delay of the LB provisioning due the fact that the cache doesn't know about the node.

jsturtevant · 2020-11-12T23:05:46Z

I ran into this again trying to set up a single control plan test for windows. This appears to be an issue only in the VMAS sceario. There is VMSS tests that is a single node:

cluster-api-provider-azure/test/e2e/azure_test.go

Lines 260 to 261 in cb486c3

    
           Context("Creating a VMSS cluster", func() { 
        
           	It("with a single control plane node and an AzureMachinePool with 2 nodes", func() {

fejta-bot · 2021-02-10T23:05:50Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

CecileRobertMichon · 2021-02-16T18:29:00Z

/remove-lifecycle stale

lastcoolnameleft · 2021-05-11T18:36:16Z

FYI, I hit this issue today.

The only change I made from the default capi-quickstart.yaml was allocate-node-cidrs: "true" and version: v1.19.9 and then I installed an Ingress Controller via our Docs.

Let me know if you'd like for me to provide the full yaml.

The public IP is available now and the ingress works; however, as you can see from the logs, it took ~10 minutes.

I0511 17:15:50.069909       1 range_allocator.go:373] Set node capi-quickstart-md-0-9hhjr PodCIDR to [192.168.2.0/24]
E0511 17:15:53.063745       1 node_lifecycle_controller.go:149] error checking if node capi-quickstart-md-0-grzqf exists: not a vmss instance
E0511 17:15:53.063832       1 node_lifecycle_controller.go:149] error checking if node capi-quickstart-md-0-9hhjr exists: not a vmss instance
W0511 17:15:53.645025       1 node_lifecycle_controller.go:1044] Missing timestamp for Node capi-quickstart-md-0-9hhjr. Assuming now as a timestamp.
I0511 17:15:53.645236       1 event.go:291] "Event occurred" object="capi-quickstart-md-0-9hhjr" kind="Node" apiVersion="v1" type="Normal" reason="RegisteredNode" message="Node capi-quickstart-md-0-9hhjr event: Registered Node capi-quickstart-md-0-9hhjr in Controller"
E0511 17:15:58.064296       1 node_lifecycle_controller.go:149] error checking if node capi-quickstart-md-0-9hhjr exists: not a vmss instance
E0511 17:15:58.064596       1 node_lifecycle_controller.go:149] error checking if node capi-quickstart-md-0-grzqf exists: not a vmss instance
I0511 17:16:02.595945       1 route_controller.go:213] Created route for node capi-quickstart-md-0-grzqf 192.168.1.0/24 with hint 5f107155-a08a-44ac-8cb7-ead0da2e3a50 after 18.211955581s
I0511 17:16:02.595993       1 route_controller.go:303] Patching node status capi-quickstart-md-0-grzqf with true previous condition was:nil
....

I0511 17:23:59.859609       1 event.go:291] "Event occurred" object="ingress-basic/nginx-ingress-ingress-nginx-controller" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to ensure load balancer: not a vmss instance"
I0511 17:26:39.859458       1 event.go:291] "Event occurred" object="ingress-basic/nginx-ingress-ingress-nginx-controller" kind="Service" apiVersion="v1" type="Normal" reason="EnsuringLoadBalancer" message="Ensuring load balancer"
E0511 17:26:40.039313       1 azure_vmss.go:1229] EnsureHostInPool(ingress-basic/nginx-ingress-ingress-nginx-controller): backendPoolID(/subscriptions/df8428d4-bc25-4601-b458-1c8533ceec0b/resourceGroups/capi-quickstart/providers/Microsoft.Network/loadBalancers/capi-quickstart/backendAddressPools/capi-quickstart) - failed to ensure host in pool: "not a vmss instance"
E0511 17:26:40.039342       1 azure_vmss.go:1229] EnsureHostInPool(ingress-basic/nginx-ingress-ingress-nginx-controller): backendPoolID(/subscriptions/df8428d4-bc25-4601-b458-1c8533ceec0b/resourceGroups/capi-quickstart/providers/Microsoft.Network/loadBalancers/capi-quickstart/backendAddressPools/capi-quickstart) - failed to ensure host in pool: "not a vmss instance"
E0511 17:26:40.039367       1 azure_vmss.go:1229] EnsureHostInPool(ingress-basic/nginx-ingress-ingress-nginx-controller): backendPoolID(/subscriptions/df8428d4-bc25-4601-b458-1c8533ceec0b/resourceGroups/capi-quickstart/providers/Microsoft.Network/loadBalancers/capi-quickstart/backendAddressPools/capi-quickstart) - failed to ensure host in pool: "not a vmss instance"
E0511 17:26:40.039383       1 azure_loadbalancer.go:162] reconcileLoadBalancer(ingress-basic/nginx-ingress-ingress-nginx-controller) failed: not a vmss instance
E0511 17:26:40.039435       1 controller.go:275] error processing service ingress-basic/nginx-ingress-ingress-nginx-controller (will retry): failed to ensure load balancer: not a vmss instance
I0511 17:26:40.039740       1 event.go:291] "Event occurred" object="ingress-basic/nginx-ingress-ingress-nginx-controller" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to ensure load balancer: not a vmss instance"
I0511 17:31:40.040301       1 event.go:291] "Event occurred" object="ingress-basic/nginx-ingress-ingress-nginx-controller" kind="Service" apiVersion="v1" type="Normal" reason="EnsuringLoadBalancer" message="Ensuring load balancer"
I0511 17:31:52.158543       1 event.go:291] "Event occurred" object="ingress-basic/nginx-ingress-ingress-nginx-controller" kind="Service" apiVersion="v1" type="Normal" reason="EnsuredLoadBalancer" message="Ensured load balancer"

lastcoolnameleft · 2021-05-11T18:37:18Z

Oh, I also tried installing the Flannel CNI, but I don't think that should have impacted it.

devigned · 2021-05-11T18:45:14Z

@lastcoolnameleft I think if you use the external cloud provider there are fixes available for this issue (see above in thread where #1216 is linked).

Rather than using the default template, you'd use --flavor external-cloud-provider.

As an aside, perhaps, we should use the out of tree provider by default...

CecileRobertMichon · 2021-05-17T18:08:30Z

@devigned @lastcoolnameleft the current version of external-cloud-provider we're using in the example template in CAPZ v0.4 doesn't have the fix yet unfortunately, the PR to bump the version (#1323) and enable the test that validates this behavior was blocked by another regression in cloud-provider, which is now released. You can work around it for now by editing your template to use version v0.7.4+ of cloud-provider, until we update the reference template.

The in-tree fix will be in k8s 1.22+.

Regarding using out of tree by default, v1.0.0 of out-of-tree provider just got released so it might be a good time that, tracking in #715

k8s-triage-robot · 2022-01-26T15:34:10Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-02-25T15:49:00Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-03-27T16:07:12Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-03-27T16:07:24Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

shysank · 2022-03-28T19:48:04Z

/remove-lifecycle rotten

CecileRobertMichon · 2022-03-28T20:28:23Z

I think we can close this now. This was fixed in the external cloud provider v0.7.4+ and k8s 1.22+.

/close

k8s-ci-robot · 2022-03-28T20:28:34Z

@CecileRobertMichon: Closing this issue.

In response to this:

I think we can close this now. This was fixed in the external cloud provider v0.7.4+ and k8s 1.22+.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Aug 3, 2020

jsturtevant mentioned this issue Aug 3, 2020

If nodes are are not in a ready state when the controller manager boots, the cache is stale causing 15 min delay of LB deployment kubernetes-sigs/cloud-provider-azure#363

Closed

jsturtevant mentioned this issue Aug 4, 2020

✨ Add single stack IPv6 support #646

Merged

3 tasks

CecileRobertMichon modified the milestones: next, v0.4.8 Aug 4, 2020

nader-ziada modified the milestones: v0.4.8, v0.4.9 Sep 3, 2020

CecileRobertMichon modified the milestones: v0.4.9, v0.4.10 Oct 1, 2020

CecileRobertMichon modified the milestones: v0.4.10, next Nov 12, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 10, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 16, 2021

CecileRobertMichon mentioned this issue Feb 23, 2021

Add load balancer test to conformance test suite #1171

Closed

3 tasks

CecileRobertMichon mentioned this issue Mar 5, 2021

Update external cloud provider flavor to use CRS and add test #1216

Merged

3 tasks

CecileRobertMichon modified the milestones: next, v0.5.x Mar 18, 2021

CecileRobertMichon mentioned this issue May 17, 2021

[backport] Update CCM and CNM templates to v0.7.4 #1378

Merged

3 tasks

nader-ziada removed this from the v0.5.x milestone Aug 26, 2021

nader-ziada added this to the v0.5 milestone Aug 26, 2021

CecileRobertMichon modified the milestones: v0.5, next Oct 28, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 26, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 25, 2022

k8s-ci-robot closed this as completed Mar 27, 2022

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 28, 2022

shysank reopened this Mar 28, 2022

k8s-ci-robot closed this as completed Mar 28, 2022

CecileRobertMichon removed this from the next milestone May 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When running a workload with a single control plane node the load balancers take 15 mins to provision #857

When running a workload with a single control plane node the load balancers take 15 mins to provision #857

jsturtevant commented Aug 3, 2020 •

edited by CecileRobertMichon

Loading

alexeldeib commented Aug 3, 2020

jsturtevant commented Aug 3, 2020 •

edited

Loading

jsturtevant commented Nov 12, 2020

fejta-bot commented Feb 10, 2021

CecileRobertMichon commented Feb 16, 2021

lastcoolnameleft commented May 11, 2021

lastcoolnameleft commented May 11, 2021

devigned commented May 11, 2021 •

edited

Loading

CecileRobertMichon commented May 17, 2021

k8s-triage-robot commented Jan 26, 2022

k8s-triage-robot commented Feb 25, 2022

k8s-triage-robot commented Mar 27, 2022

k8s-ci-robot commented Mar 27, 2022

shysank commented Mar 28, 2022

CecileRobertMichon commented Mar 28, 2022

k8s-ci-robot commented Mar 28, 2022

When running a workload with a single control plane node the load balancers take 15 mins to provision #857

When running a workload with a single control plane node the load balancers take 15 mins to provision #857

Comments

jsturtevant commented Aug 3, 2020 • edited by CecileRobertMichon Loading

alexeldeib commented Aug 3, 2020

jsturtevant commented Aug 3, 2020 • edited Loading

jsturtevant commented Nov 12, 2020

fejta-bot commented Feb 10, 2021

CecileRobertMichon commented Feb 16, 2021

lastcoolnameleft commented May 11, 2021

lastcoolnameleft commented May 11, 2021

devigned commented May 11, 2021 • edited Loading

CecileRobertMichon commented May 17, 2021

k8s-triage-robot commented Jan 26, 2022

k8s-triage-robot commented Feb 25, 2022

k8s-triage-robot commented Mar 27, 2022

k8s-ci-robot commented Mar 27, 2022

shysank commented Mar 28, 2022

CecileRobertMichon commented Mar 28, 2022

k8s-ci-robot commented Mar 28, 2022

jsturtevant commented Aug 3, 2020 •

edited by CecileRobertMichon

Loading

jsturtevant commented Aug 3, 2020 •

edited

Loading

devigned commented May 11, 2021 •

edited

Loading