Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When running a workload with a single control plane node the load balancers take 15 mins to provision #857

Closed
jsturtevant opened this issue Aug 3, 2020 · 16 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@jsturtevant
Copy link
Contributor

jsturtevant commented Aug 3, 2020

/kind bug

status (as of 6/10/21):

What steps did you take and what happened:
[A clear and concise description of what the bug is.]
When running a workload with a single control plane node the load balancers take 15 mins to provision.

Add the following to the Creating a single control-plane cluster with 1 worker node e2e test after cluster creation:

AzureLBSpec(ctx, func() AzureLBSpecInput {
					return AzureLBSpecInput{
						BootstrapClusterProxy: bootstrapClusterProxy,
						Namespace:             namespace,
						ClusterName:           clusterName,
						SkipCleanup:           skipCleanup,
					}
				})

Run the e2e test and the test will fail:

./scripts/ci-e2e.sh

## commented out

Workload cluster creation                                                                                                                                                                           
[1] /home/jstur/projects/cluster-api-provider-azure/test/e2e/azure_test.go:36                                                                                                                           
[1]   Creating a single control-plane cluster                                                                                                                                                           
[1]   /home/jstur/projects/cluster-api-provider-azure/test/e2e/azure_test.go:71                                                                                                                         
[1]     With 1 worker node [It]                                                                                                                                                                         
[1]     /home/jstur/projects/cluster-api-provider-azure/test/e2e/azure_test.go:72                                                                                                                       
[1]                                                                                                                                                                                                     
[1]     Timed out after 180.000s.                                                                                                                                                                       
[1]     Service default/ingress-nginx-ilb failed to get an IP for LoadBalancer.Ingress                                                                                                                  
[1]     Expected                                                                                                                                                                                        
[1]         <bool>: false                                                                                                                                                                               
[1]     to be true                                                                                                                                                                                      
[1]                                                                                                                                                                                                     
[1]     /home/jstur/projects/cluster-api-provider-azure/test/e2e/helpers.go:97  

If you connect to the workload cluster you will see the service with the load balancer is there and after 15 mins will provision. Subsequent services with load balancers will provision quickly. The logs of the controller manager will contain:

E0802 23:30:42.090814       1 azure_vmss.go:1116] EnsureHostInPool(default/ingress-nginx-ilb): backendPoolID(/subscriptions/b9d9436a-0c07-4fe8-b779-2c1030bd7997/resourceGroups/capz-e2e-72fll1/providers/Microsoft.Network/loadBalancers/capz-e2e-72fll1-internal/backendAddressPools/capz-e2e-72fll1) - failed to ensure host in pool: "not a vmss instance"

What did you expect to happen:
That the tests should be able to provision a workload cluster and pass a e2e test that creates a load balancer.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

This is related to kubernetes-sigs/cloud-provider-azure#338

Environment:

  • cluster-api-provider-azure version:
  • Kubernetes version: (use kubectl version):
  • OS (e.g. from /etc/os-release):
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Aug 3, 2020
@alexeldeib
Copy link
Contributor

Guess this is another code path related to not using Availability Sets? We should probably consider that as mitigation. Anything that tries to look up IDs will fail with our current setup. It's hard to track down all the places individually.

@jsturtevant
Copy link
Contributor Author

jsturtevant commented Aug 3, 2020

It is related but not the root cause, The root cause is the cache used in the controller-manager. I provided more details in kubernetes-sigs/cloud-provider-azure#363.

I believe this could cause delays in a customer scenario where a node is added after the the cluster is provisioned causing a delay of the LB provisioning due the fact that the cache doesn't know about the node.

@CecileRobertMichon CecileRobertMichon modified the milestones: next, v0.4.8 Aug 4, 2020
@nader-ziada nader-ziada modified the milestones: v0.4.8, v0.4.9 Sep 3, 2020
@CecileRobertMichon CecileRobertMichon modified the milestones: v0.4.9, v0.4.10 Oct 1, 2020
@CecileRobertMichon CecileRobertMichon modified the milestones: v0.4.10, next Nov 12, 2020
@jsturtevant
Copy link
Contributor Author

I ran into this again trying to set up a single control plan test for windows. This appears to be an issue only in the VMAS sceario. There is VMSS tests that is a single node:

Context("Creating a VMSS cluster", func() {
It("with a single control plane node and an AzureMachinePool with 2 nodes", func() {

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 10, 2021
@CecileRobertMichon
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 16, 2021
@CecileRobertMichon CecileRobertMichon modified the milestones: next, v0.5.x Mar 18, 2021
@lastcoolnameleft
Copy link
Contributor

FYI, I hit this issue today.

The only change I made from the default capi-quickstart.yaml was allocate-node-cidrs: "true" and version: v1.19.9 and then I installed an Ingress Controller via our Docs.

Let me know if you'd like for me to provide the full yaml.

The public IP is available now and the ingress works; however, as you can see from the logs, it took ~10 minutes.

I0511 17:15:50.069909       1 range_allocator.go:373] Set node capi-quickstart-md-0-9hhjr PodCIDR to [192.168.2.0/24]
E0511 17:15:53.063745       1 node_lifecycle_controller.go:149] error checking if node capi-quickstart-md-0-grzqf exists: not a vmss instance
E0511 17:15:53.063832       1 node_lifecycle_controller.go:149] error checking if node capi-quickstart-md-0-9hhjr exists: not a vmss instance
W0511 17:15:53.645025       1 node_lifecycle_controller.go:1044] Missing timestamp for Node capi-quickstart-md-0-9hhjr. Assuming now as a timestamp.
I0511 17:15:53.645236       1 event.go:291] "Event occurred" object="capi-quickstart-md-0-9hhjr" kind="Node" apiVersion="v1" type="Normal" reason="RegisteredNode" message="Node capi-quickstart-md-0-9hhjr event: Registered Node capi-quickstart-md-0-9hhjr in Controller"
E0511 17:15:58.064296       1 node_lifecycle_controller.go:149] error checking if node capi-quickstart-md-0-9hhjr exists: not a vmss instance
E0511 17:15:58.064596       1 node_lifecycle_controller.go:149] error checking if node capi-quickstart-md-0-grzqf exists: not a vmss instance
I0511 17:16:02.595945       1 route_controller.go:213] Created route for node capi-quickstart-md-0-grzqf 192.168.1.0/24 with hint 5f107155-a08a-44ac-8cb7-ead0da2e3a50 after 18.211955581s
I0511 17:16:02.595993       1 route_controller.go:303] Patching node status capi-quickstart-md-0-grzqf with true previous condition was:nil
....

I0511 17:23:59.859609       1 event.go:291] "Event occurred" object="ingress-basic/nginx-ingress-ingress-nginx-controller" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to ensure load balancer: not a vmss instance"
I0511 17:26:39.859458       1 event.go:291] "Event occurred" object="ingress-basic/nginx-ingress-ingress-nginx-controller" kind="Service" apiVersion="v1" type="Normal" reason="EnsuringLoadBalancer" message="Ensuring load balancer"
E0511 17:26:40.039313       1 azure_vmss.go:1229] EnsureHostInPool(ingress-basic/nginx-ingress-ingress-nginx-controller): backendPoolID(/subscriptions/df8428d4-bc25-4601-b458-1c8533ceec0b/resourceGroups/capi-quickstart/providers/Microsoft.Network/loadBalancers/capi-quickstart/backendAddressPools/capi-quickstart) - failed to ensure host in pool: "not a vmss instance"
E0511 17:26:40.039342       1 azure_vmss.go:1229] EnsureHostInPool(ingress-basic/nginx-ingress-ingress-nginx-controller): backendPoolID(/subscriptions/df8428d4-bc25-4601-b458-1c8533ceec0b/resourceGroups/capi-quickstart/providers/Microsoft.Network/loadBalancers/capi-quickstart/backendAddressPools/capi-quickstart) - failed to ensure host in pool: "not a vmss instance"
E0511 17:26:40.039367       1 azure_vmss.go:1229] EnsureHostInPool(ingress-basic/nginx-ingress-ingress-nginx-controller): backendPoolID(/subscriptions/df8428d4-bc25-4601-b458-1c8533ceec0b/resourceGroups/capi-quickstart/providers/Microsoft.Network/loadBalancers/capi-quickstart/backendAddressPools/capi-quickstart) - failed to ensure host in pool: "not a vmss instance"
E0511 17:26:40.039383       1 azure_loadbalancer.go:162] reconcileLoadBalancer(ingress-basic/nginx-ingress-ingress-nginx-controller) failed: not a vmss instance
E0511 17:26:40.039435       1 controller.go:275] error processing service ingress-basic/nginx-ingress-ingress-nginx-controller (will retry): failed to ensure load balancer: not a vmss instance
I0511 17:26:40.039740       1 event.go:291] "Event occurred" object="ingress-basic/nginx-ingress-ingress-nginx-controller" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to ensure load balancer: not a vmss instance"
I0511 17:31:40.040301       1 event.go:291] "Event occurred" object="ingress-basic/nginx-ingress-ingress-nginx-controller" kind="Service" apiVersion="v1" type="Normal" reason="EnsuringLoadBalancer" message="Ensuring load balancer"
I0511 17:31:52.158543       1 event.go:291] "Event occurred" object="ingress-basic/nginx-ingress-ingress-nginx-controller" kind="Service" apiVersion="v1" type="Normal" reason="EnsuredLoadBalancer" message="Ensured load balancer"

@lastcoolnameleft
Copy link
Contributor

Oh, I also tried installing the Flannel CNI, but I don't think that should have impacted it.

@devigned
Copy link
Contributor

devigned commented May 11, 2021

@lastcoolnameleft I think if you use the external cloud provider there are fixes available for this issue (see above in thread where #1216 is linked).

Rather than using the default template, you'd use --flavor external-cloud-provider.

As an aside, perhaps, we should use the out of tree provider by default...

@CecileRobertMichon
Copy link
Contributor

@devigned @lastcoolnameleft the current version of external-cloud-provider we're using in the example template in CAPZ v0.4 doesn't have the fix yet unfortunately, the PR to bump the version (#1323) and enable the test that validates this behavior was blocked by another regression in cloud-provider, which is now released. You can work around it for now by editing your template to use version v0.7.4+ of cloud-provider, until we update the reference template.

The in-tree fix will be in k8s 1.22+.

Regarding using out of tree by default, v1.0.0 of out-of-tree provider just got released so it might be a good time that, tracking in #715

@nader-ziada nader-ziada added this to the v0.5 milestone Aug 26, 2021
@CecileRobertMichon CecileRobertMichon modified the milestones: v0.5, next Oct 28, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 26, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 25, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@shysank
Copy link
Contributor

shysank commented Mar 28, 2022

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 28, 2022
@shysank shysank reopened this Mar 28, 2022
@CecileRobertMichon
Copy link
Contributor

I think we can close this now. This was fixed in the external cloud provider v0.7.4+ and k8s 1.22+.

/close

@k8s-ci-robot
Copy link
Contributor

@CecileRobertMichon: Closing this issue.

In response to this:

I think we can close this now. This was fixed in the external cloud provider v0.7.4+ and k8s 1.22+.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@CecileRobertMichon CecileRobertMichon removed this from the next milestone May 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

10 participants