Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Include machines in deleting status when calculate machineset replicas #3434

Merged
merged 1 commit into from
Dec 15, 2020

Conversation

jzhoucliqr
Copy link
Contributor

What this PR does / why we need it:
This PR fix the issue that MachineDeployment RollingUpdate may launch machines more than (MD.Spec.Replicas + maxSurge). It include machines in deleting status when calculate MachineSet status replicas.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #3417

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 30, 2020
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jul 30, 2020
@k8s-ci-robot
Copy link
Contributor

Hi @jzhoucliqr. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jul 30, 2020
@jzhoucliqr jzhoucliqr changed the title 🐛 include machines in deleting status when calculate machineset replicas 🐛 Include machines in deleting status when calculate machineset replicas Jul 30, 2020
controllers/machineset_controller.go Outdated Show resolved Hide resolved
@@ -521,7 +536,7 @@ func NewMSNewReplicas(deployment *clusterv1.MachineDeployment, allMSs []*cluster
return 0, err
}
// Find the total number of machines
currentMachineCount := GetReplicaCountForMachineSets(allMSs)
currentMachineCount := GetTotalReplicaCountForMachineSets(allMSs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can assume that a MachineSet's status.replicas includes deleting or not yet ready Machines, then it seems like it should be okay to use GetActualReplicaCountForMachineSets rather than needing to take the max here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The max is for the case that, for two machinesets, one spec=0,status=2, the other spec=1,status=0. Then at this moment total count should be 3 instead of 2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, I'm wondering if we need to avoid increasing the spec for the new machineSet until things have finished ensuring that the old machineSet has finished stabilizing if increasing would exceed the max surge... I guess I'm wondering if this is the correct place to try to do that enforcement, especially since using the max of spec.replicas and status.replicas seems to be more of an approximation of trying to do the right thing vs giving assurance that we are doing the right thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to understand this. Do you mean we could allow exceed max surge here in MS.spec, and do enforce within MS controller that won't actually create the Machine object if it already exceed surge? If this understanding is correct, I'm not sure how MS controller will be able to do that, because maxSurge and total number of replicas only available in MD.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Jason was suggesting if we need a check in the webhook validator to ensure that the status is up to date before allowing any type of resize. Probably worth a different issue though

controllers/mdutil/util.go Show resolved Hide resolved
controllers/machineset_controller.go Outdated Show resolved Hide resolved
controllers/mdutil/util.go Outdated Show resolved Hide resolved
controllers/mdutil/util.go Outdated Show resolved Hide resolved
@ncdc
Copy link
Contributor

ncdc commented Jul 30, 2020

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 30, 2020
@jzhoucliqr
Copy link
Contributor Author

/test pull-cluster-api-verify

@jzhoucliqr
Copy link
Contributor Author

/retest

controllers/mdutil/util.go Outdated Show resolved Hide resolved
controllers/machineset_controller.go Outdated Show resolved Hide resolved
controllers/machineset_controller.go Outdated Show resolved Hide resolved
controllers/mdutil/util.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 18, 2020
@jzhoucliqr
Copy link
Contributor Author

/retest

// Use max(spec.Replicas,status.Replicas) to cover the cases that:
// 1. Scale up, where spec.Replicas increased but no machine created yet, so spec.Replicas > status.Replicas
// 2. Scale down, where spec.Replicas decreased but machine not deleted yet, so spec.Replicas < status.Replicas
func GetSumOfMaxOfSpecAndStatusReplicaCountForMachineSets(machineSets []*clusterv1.MachineSet) int32 {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this make it better or worse ..

@jzhoucliqr jzhoucliqr force-pushed the ms-rep-count branch 2 times, most recently from c899fc4 to babff3f Compare August 20, 2020 23:24
@vincepri
Copy link
Member

LGTM pending squash

@jzhoucliqr
Copy link
Contributor Author

Thanks a lot @vincepri

Copy link
Member

@vincepri vincepri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
/assign @ncdc @fabriziopandini

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vincepri

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 21, 2020
@vincepri
Copy link
Member

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 28, 2020
@vincepri
Copy link
Member

/retest

@vincepri
Copy link
Member

/assign @ncdc @detiber

@vincepri
Copy link
Member

/milestone v0.3.10

@k8s-ci-robot k8s-ci-robot added this to the v0.3.10 milestone Sep 29, 2020
@vincepri
Copy link
Member

Moving this to v0.4

/milestone v0.4.0

@k8s-ci-robot k8s-ci-robot modified the milestones: v0.3.10, v0.4.0 Sep 30, 2020
@jzhoucliqr
Copy link
Contributor Author

@vincepri do you think we if we can merge this in to v0.3.11 ? There is no API change, only minor behavior change.
I'll resolve the conflicts.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 29, 2020
@vincepri
Copy link
Member

vincepri commented Nov 9, 2020

@jzhoucliqr Apologies missed your message, let's rebase and merge for v0.4.0 first and get another round of review. If the fix is appropriate, we should backport it to release-0.3 this week.

@vincepri
Copy link
Member

vincepri commented Nov 9, 2020

/assign

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 10, 2020
@jzhoucliqr
Copy link
Contributor Author

Thanks @vincepri . Rebase done.

Copy link
Contributor

@ncdc ncdc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These comments are super minor nits that are non-blocking if we'd like to get this in without additional revisions, given the PR has been sitting for almost a month (so sorry about this!!!). @detiber did you get all your questions/comments answered & are you ok to proceed with this PR?

controllers/mdutil/util.go Outdated Show resolved Hide resolved
controllers/machineset_controller.go Outdated Show resolved Hide resolved
controllers/machineset_controller.go Outdated Show resolved Hide resolved
@jzhoucliqr
Copy link
Contributor Author

/retest

@vincepri
Copy link
Member

/lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MD RollingUpdate support to only add new node after one old node deleted completely
6 participants