Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix stale replicas issue with cluster-autoscaler CAPI provider #3177

Merged
merged 4 commits into from
Jun 3, 2020

Conversation

elmiko
Copy link
Contributor

@elmiko elmiko commented Jun 2, 2020

This change brings in a series of patches to remediate the issues with MachineSet and MachineDeployment replicas becoming stale. These changes add protection around the deletion mechanisms by adding a mutex to this operation and also changing the check versus 0 to use the minimum size for that group. Additionally, the operations to get the replicas are now using API server calls to ensure that the freshest information is available during replica count checks. Lastly, a minor change to the unit tests is added to address the underlying changes with respect to API server queries for replicas.

fixes: #3104

/area provider/cluster-api

elmiko and others added 4 commits June 2, 2020 13:58
This change adds a mutex to the MachineController structure which is
used to gate access to the DeleteNodes function.

This is one in a series of PRs to mitigate kubernetes#3104
When getting Replicas() the local struct in the scalable resource might be stale. To mitigate possible side effects, we want always get a fresh replicas.

This is one in a series of PR to mitigate kubernetes#3104
provider

When calling deleteNodes() we should fail early if the operation could delete nodes below the nodeGroup minSize().

This is one in a series of PR to mitigate kubernetes#3104
@k8s-ci-robot k8s-ci-robot added area/provider/cluster-api Issues or PRs related to Cluster API provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 2, 2020
@k8s-ci-robot k8s-ci-robot requested review from hardikdr and ncdc June 2, 2020 18:57
@elmiko
Copy link
Contributor Author

elmiko commented Jun 2, 2020

cc @enxebre @JoelSpeed @detiber

Copy link
Member

@detiber detiber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 2, 2020
@enxebre
Copy link
Member

enxebre commented Jun 3, 2020

thanks!
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 3, 2020
@k8s-ci-robot k8s-ci-robot merged commit 1ae89b9 into kubernetes:master Jun 3, 2020
@elmiko elmiko deleted the issue/3104 branch June 3, 2020 17:55
k8s-ci-robot added a commit that referenced this pull request Jul 29, 2020
[CA-1.18] #3177 cherry-pick: Fix stale replicas issue with cluster-autoscaler CAPI provider
benmoss pushed a commit to benmoss/autoscaler that referenced this pull request Sep 28, 2020
[CA-1.18] kubernetes#3177 cherry-pick: Fix stale replicas issue with cluster-autoscaler CAPI provider
colin-welch pushed a commit to Paperspace/autoscaler that referenced this pull request Mar 5, 2021
[CA-1.18] kubernetes#3177 cherry-pick: Fix stale replicas issue with cluster-autoscaler CAPI provider
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/provider/cluster-api Issues or PRs related to Cluster API provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

capi: Replicas can go stale in nodeGroup causing erratic behaviour
5 participants