-
Notifications
You must be signed in to change notification settings - Fork 276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
If nodes are are not in a ready state when the controller manager boots, the cache is stale causing 15 min delay of LB deployment #363
Comments
Possibly related to cache read mode? https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/legacy-cloud-providers/azure/azure_vmss.go#L1196 possibly should be VMSS does seem to force refresh if the instances aren't found: https://github.com/kubernetes/kubernetes/blob/ec560b9737537be8c688776461bc700e8ddedb9d/staging/src/k8s.io/legacy-cloud-providers/azure/azure_vmss.go#L276-L295 |
|
For AS I believe it is because the nodes are stored in the cache as a set. The set is cached and so the request for a single instance doesn't count as a cache miss and doesn't refresh: |
@aramase @jsturtevant @alexeldeib do we have an idea of what is needed to fix this? |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle rotten pretty sure this still exists...need to refresh my memory for a fix |
/assign looking into this |
So this fixes it: fcb1df0 the issue with that solution is that it will cause a force refresh of the VM cache on every run of The previous reasoning for calling it with Now in CAPZ we have both VM and VMSS worker nodes joining and leaving the cluster with scale ups/downs so this means when we check if a node is VM or VMSS we need to have the latest information. A better solution might be to trigger a cache refresh only when the list of nodes in the cluster has changed since last time the cache was refreshed (need to look into how feasible that is though). @aramase @alexeldeib @jsturtevant wdyt? |
Here's an attempt at only doing a refresh of the VM cache if the node is new to the cluster: master...CecileRobertMichon:fix-vm-cache This should reduce a lot of the extra refreshes when the node is a known vmss node, and still refresh when the node is new to the cluster and we don't know if it's vm or vmss yet. Haven't tested it works yet but it's the general idea. |
What happened:
When adding functionality to the e2e tests in kubernetes-sigs/cluster-api-provider-azure#857 for CAPZ, I found the e2e tests would fail if run against a single control plane node cluster. This would occasionally happen when running multi-node control plane clusters as well. This was initially reported with some of the message in #338
What happens is the control manager comes online and queries for azure machines power status. Since the machines are not available yet they are not in the cache. When the request comes in for the load balancer the cache is queried and reports that the node does not exist as a VMAS and attempts to run the VMSS code hence the following error message. When it is found the in the cache it goes down the correct code path.
I was able to confirm the cache was missing the node by adding some custom logs to the controller-manager. The initial request to the cache doesn't have the nodes:
You can see request for powerstatus here:
And request for EnsureHostsInPool doesn't have the VMAS nodes in the cache:
What you expected to happen:
How to reproduce it:
Follow the instructions at kubernetes-sigs/cluster-api-provider-azure#857 or create a single node control plane cluster then after the control node is up, add a worker node.
Anything else we need to know?:
In multi control plane node clusters, leader election occurs in the controller-manager and causes the cache to reload. This is why it only happens occasionally in that scenario.
I believe this could happen if a node is added at a later time and was expected to be in the backend pool but wasn't and so could be important to fix.
Environment:
kubectl version
):1.18.6
uname -a
):The text was updated successfully, but these errors were encountered: