Fix availability set cache in vmss cache #537

CecileRobertMichon · 2021-03-05T01:39:03Z

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind api-change
/kind bug
/kind cleanup
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

/kind bug

What this PR does / why we need it: This fixes a bug affecting clusters with virtual machines when vmType is set to "vmss". What happens is the control manager comes online and queries for azure machines power status. Since the machines are not available yet they are not in the cache. When the request comes in for the load balancer the cache is queried and reports that the node does not exist as a VMAS and attempts to run the VMSS code hence the following error message: failed: not a vmss instance. The same error also occurs when trying to expose a load balancer service. When it is found the cache it goes down the correct code path.

This adds logic to the vmss cache to refresh the availability set cache if the node it is querying for has joined the cluster since the last cache refresh, or if the cache has expired. Note that this approach minimizes cache refreshes by not refreshing unless necessary, but it does add extra GET API calls whenever a new node joins the cluster. This can be skipped altogether with the DisableAvailabilitySetNodes flag that is already implemented. It's not perfect but it will fix most scenarios for CAPZ. In the future, I'd like to propose a new "mixed" or "hybrid" VMSet implementation that is truly aware of both VMs and VMSS nodes as a more long term fix.

Which issue(s) this PR fixes:

Fixes #363

Special notes for your reviewer: @feiskyer if I want this fix to make it back into the in tree provider, what's the best way to proceed?

Release note:

Fix availability set cache in vmss cache

CecileRobertMichon · 2021-03-05T01:39:38Z

/assign @aramase @jsturtevant @alexeldeib @feiskyer

CecileRobertMichon · 2021-03-05T17:49:51Z

will fix/add UTs shortly. Please feel free to review the code in the meantime.

nilo19 · 2021-03-07T08:35:47Z

/lgtm

pkg/provider/azure_vmss_cache.go

feiskyer · 2021-03-08T02:21:32Z

/hold

feiskyer · 2021-03-08T02:24:39Z

pkg/provider/azure_vmss.go

@@ -210,7 +211,7 @@ func (ss *scaleSet) getVmssVM(nodeName string, crt azcache.AzureCacheReadType) (

 // GetPowerStatusByNodeName returns the power state of the specified node.
 func (ss *scaleSet) GetPowerStatusByNodeName(name string) (powerState string, err error) {
-	managedByAS, err := ss.isNodeManagedByAvailabilitySet(name, azcache.CacheReadTypeUnsafe)


This change would cause every not found node to trigger cache updates and may cause issues when a large number of nodes are deleted or newly created. Have you done some validations on these scenarios?

Correct. See my note in PR description:

Note that this approach minimizes cache refreshes by not refreshing unless necessary, but it does add extra GET API calls whenever a new node joins the cluster. This can be skipped altogether with the DisableAvailabilitySetNodes flag that is already implemented. It's not perfect but it will fix most scenarios for CAPZ. In the future, I'd like to propose a new "mixed" or "hybrid" VMSet implementation that is truly aware of both VMs and VMSS nodes as a more long-term fix.

We already do the same thing for VMSS: https://github.com/kubernetes-sigs/cloud-provider-azure/blob/master/pkg/provider/azure_vmss.go#L182-L188

There isn't a way to avoid this, unfortunately, if the node just joined the cluster and we've never fetched its corresponding VM/VMSS.

since we are doing the refresh below based on the new diff between new nodes and known vms, I am not sure that this change is required? If it isn't in the cache but the node is new then it is forced by https://github.com/kubernetes-sigs/cloud-provider-azure/pull/537/files#diff-e6715e02b5aa5f8b693bf3e14dd99dd9a160f0f0f1481df96891af24fc3a5fffR333

may cause issues when a large number of nodes are deleted or newly created

This will only be an issue if the nodes come up in sequence? I.E. if a node is added every 5 mins we would have an api call every five minutes.

Do we have any data on how often that type scenario occurs? Maybe during node auto scaler operations?

Yeah, I think we can keep the cache read type unsafe as we are doing a force refresh if not found in the cache.

Yeah, I think we can keep the cache read type unsafe as we are doing a force refresh if not found in the cache.

Correct, reverted that change

thanks for the discussion and addressing the issues

pkg/provider/azure_vmss_cache.go

jsturtevant · 2021-03-08T22:43:51Z

I think this does solve the 15 mins on initial boot or when the nodes are fetch right before a scale event and it does mitigate the extra api calls by only doing it if the node is new to the cluster.

Couple of thoughts for the future:

Do we have any types of tests or data that calculates the number of API calls to see if a code change make a significant increase in the number of api calls?
Is there an label on the node that allows for us to determine if it is VMSS or VMAS? Is that something worth considering?

pkg/provider/azure.go

aramase · 2021-03-09T01:20:41Z

pkg/provider/azure_vmss.go

@@ -210,7 +211,7 @@ func (ss *scaleSet) getVmssVM(nodeName string, crt azcache.AzureCacheReadType) (

 // GetPowerStatusByNodeName returns the power state of the specified node.
 func (ss *scaleSet) GetPowerStatusByNodeName(name string) (powerState string, err error) {
-	managedByAS, err := ss.isNodeManagedByAvailabilitySet(name, azcache.CacheReadTypeUnsafe)


Yeah, I think we can keep the cache read type unsafe as we are doing a force refresh if not found in the cache.

aramase · 2021-03-09T01:21:55Z

Is there an label on the node that allows for us to determine if it is VMSS or VMAS? Is that something worth considering?

Typically this is determined from the ProviderID field on the node? But if the node has not yet been registered, then the field is empty.

feiskyer

/hold cancel

/lgtm
/approve

k8s-ci-robot · 2021-03-10T05:25:09Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CecileRobertMichon, feiskyer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [feiskyer]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2021-03-10T05:28:25Z

New changes are detected. LGTM label has been removed.

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 5, 2021

k8s-ci-robot requested review from andyzhangx and nilo19 March 5, 2021 01:39

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Mar 5, 2021

k8s-ci-robot assigned alexeldeib, aramase, feiskyer and jsturtevant Mar 5, 2021

CecileRobertMichon force-pushed the fix-vm-cache branch from 5c9c1ab to 0eec650 Compare March 6, 2021 00:58

k8s-ci-robot assigned nilo19 Mar 7, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 7, 2021

feiskyer reviewed Mar 8, 2021

View reviewed changes

pkg/provider/azure_vmss_cache.go Show resolved Hide resolved

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 8, 2021

feiskyer reviewed Mar 8, 2021

View reviewed changes

pkg/provider/azure_vmss_cache.go Show resolved Hide resolved

aramase reviewed Mar 9, 2021

View reviewed changes

fix: Refresh VM cache when node is not found

fd315b0

CecileRobertMichon force-pushed the fix-vm-cache branch from 0eec650 to fd315b0 Compare March 9, 2021 01:27

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 9, 2021

feiskyer reviewed Mar 10, 2021

View reviewed changes

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Mar 10, 2021

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 10, 2021

Merge branch 'master' into fix-vm-cache

14c825c

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 10, 2021

feiskyer added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 10, 2021

k8s-ci-robot merged commit db6de35 into kubernetes-sigs:master Mar 10, 2021

CecileRobertMichon mentioned this pull request Mar 11, 2021

Cherry pick #537 from cloud provider azure: Refresh VM cache when node is not found kubernetes/kubernetes#100110

Merged

CecileRobertMichon mentioned this pull request Jun 10, 2021

When running a workload with a single control plane node the load balancers take 15 mins to provision kubernetes-sigs/cluster-api-provider-azure#857

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix availability set cache in vmss cache #537

Fix availability set cache in vmss cache #537

CecileRobertMichon commented Mar 5, 2021

CecileRobertMichon commented Mar 5, 2021

CecileRobertMichon commented Mar 5, 2021

nilo19 commented Mar 7, 2021

feiskyer commented Mar 8, 2021

feiskyer Mar 8, 2021

CecileRobertMichon Mar 8, 2021

jsturtevant Mar 8, 2021

jsturtevant Mar 8, 2021

aramase Mar 9, 2021

CecileRobertMichon Mar 9, 2021

feiskyer Mar 10, 2021

jsturtevant commented Mar 8, 2021

aramase Mar 9, 2021

aramase commented Mar 9, 2021

feiskyer left a comment

k8s-ci-robot commented Mar 10, 2021

k8s-ci-robot commented Mar 10, 2021

Fix availability set cache in vmss cache #537

Fix availability set cache in vmss cache #537

Conversation

CecileRobertMichon commented Mar 5, 2021

CecileRobertMichon commented Mar 5, 2021

CecileRobertMichon commented Mar 5, 2021

nilo19 commented Mar 7, 2021

feiskyer commented Mar 8, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsturtevant commented Mar 8, 2021

Choose a reason for hiding this comment

aramase commented Mar 9, 2021

feiskyer left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Mar 10, 2021

k8s-ci-robot commented Mar 10, 2021