🏃Remove annotations on upgradeControlPlane #2887

benmoss · 2020-04-09T13:50:35Z

What this PR does / why we need it:
Removes use of annotations in upgrade logic. We were duplicating the logic of selecting the machine in the scale down code, and it also causes bugs like #2430 where the cached state stored in the annotation gets out of sync with the world.

This logic working is dependent on the fact that scaleUp / TargetClusterControlPlaneIsHealthy ensures there's a 1:1 correspondence of nodes and machines so we don't continue scaling up while new machines are still joining the cluster.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #2702

michaelgugino

This PR is linked to my PR rather than an issue on the last line, and it going to close my PR. Please update the first comment to point to the correct issue.

detiber · 2020-04-09T14:00:51Z

This PR is linked to my PR rather than an issue on the last line, and it going to close my PR. Please update the first comment to point to the correct issue.

I updated the reference

vincepri · 2020-04-09T15:03:26Z

/assign @fabriziopandini @sedefsavas @yastij

for extra review 👀

vincepri · 2020-04-09T15:03:34Z

/milestone v0.3.4

benmoss · 2020-04-09T15:14:36Z

Sorry about that, not sure how I messed that up

benmoss · 2020-04-09T18:56:43Z

/hold
I realize I am not including all the logic that was in selectAndMarkMachine

benmoss · 2020-04-09T18:59:11Z

/hold cancel
actually nevermind, I just can't remove that from scaleDownControlPlane 😌

controlplane/kubeadm/controllers/upgrade.go

vincepri

Chatted yesterday with @benmoss, this PR is only a first step in removing the annotations. We'll follow-up with more changes, more specifically

Add the concept of outdated machines in the filters.
Make scaleDownControlPlane to take only the ownedMachines as input, and decide on its own which Machine should be scaled down
Picking a Machine for scale down will happen in selectAndMarkMachine (which should probably be renamed)
- Modify the logic to first look at the outdated Machines. If there is any outdated Machines, use this list as collection.
- If there is no outdated Machines, pick the oldest one in the FailureDomain with most machines.

benmoss · 2020-04-10T15:26:48Z

Going to update this to remove all annotations, there's also a bug right now with selectedMachines / ownedMachines in scale down
/hold

controlplane/kubeadm/controllers/upgrade.go

controlplane/kubeadm/controllers/upgrade_test.go

controlplane/kubeadm/controllers/scale.go

Upgrade logic no longer uses machine annotations Logic is now: - Check number of nodes in workload cluster - If node count <= replicas, create new upgrade machine - If node count > replicas, scale down Scale up logic ensures that we don't create additional machines if we reconcile while waiting for an upgrade machine to appear in the node list Scale up should only consider machines needing upgrade We never support upgrading a subset of the cluster, but this will ensure that we pick the FD that has the most machines needing upgrade, rather than just the FD with the most machines. Also add a comment to explain why scale up will not cause more than 1 machine to be created Scale down always scales down outdated machines first This removes the need to pass through outdated machines

benmoss · 2020-04-15T22:23:12Z

/hold cancel

vincepri

/approve

k8s-ci-robot · 2020-04-16T13:42:33Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: benmoss, vincepri

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [vincepri]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

fabriziopandini

@benmoss looks great!
only few questions from my side to double-check some code steps

fabriziopandini · 2020-04-16T14:00:16Z

controlplane/kubeadm/controllers/scale.go

+		return ctrl.Result{}, &capierrors.RequeueAfterError{RequeueAfter: healthCheckFailedRequeueAfter}
+
+	}
+	if err := workloadCluster.RemoveMachineFromKubeadmConfigMap(ctx, machineToDelete); err != nil {


Just to double-check this.
if there is an error after removing the machine from the config map but before deleting, when re-entering this code machineToDelete will be the same, right?

I don't think that can be guaranteed, since a user could encounter an error and change their cluster such that the next reconciliation doesn't retry scale down. At that point any number of changes can happen before the next scale down (like more scale up) such that the partially removed machine isn't the same one removed.

fabriziopandini · 2020-04-16T14:14:41Z

controlplane/kubeadm/controllers/upgrade.go

-
-func (r *KubeadmControlPlaneReconciler) selectAndMarkMachine(ctx context.Context, machines internal.FilterableMachineCollection, annotation string, controlPlane *internal.ControlPlane) (*clusterv1.Machine, error) {
-	selected, err := controlPlane.MachineInFailureDomainWithMostMachines(machines)
+	status, err := workloadCluster.ClusterStatus(ctx)


I know this is not part of this PR, but ClusterStatus confused me.
It is ControlPlane status (workers are not considered)...

No I think this is absolutely true, I can't believe this didn't occur to me earlier

Oh, nevermind! I thought your point was that I am looking at all nodes, and not just control plane nodes, and this was a huge bug 🤦 . But in fact it's that this method is only for control plane nodes, even though it seems more like it should be for all nodes.

fabriziopandini · 2020-04-16T14:21:54Z

controlplane/kubeadm/controllers/scale.go

+	if needingUpgrade := controlPlane.MachinesNeedingUpgrade(); needingUpgrade.Len() > 0 {
+		machines = needingUpgrade
+	}
+	return controlPlane.MachineInFailureDomainWithMostMachines(machines)


If I'm not wrong, in case of upgrades, this is considering only the machines to upgrades, not all the control-plane machines.
Is it possible this leads to an un-unexpected machine distribution? probably this is compensated by scale-up always looking at all the machines for placement...

I don't think it will change the machine distribution. We scale down the FD with the most un-upgraded machines, but we scale back up by smallest FD overall

Here's a run-through of what I understand the algorithm to be: https://gist.github.com/benmoss/9914fb4e09e1e4fed4a651119e983298

fabriziopandini · 2020-04-16T14:27:58Z

controlplane/kubeadm/controllers/upgrade.go

 	}

-	return selected, nil
+	if status.Nodes <= *kcp.Spec.Replicas {
+		// scaleUp ensures that we don't continue scaling up while waiting for Machines to have NodeRefs


just for my knowledge, where is this logic? I cant' find it...

https://github.com/benmoss/cluster-api/blob/remove-annotations/controlplane/kubeadm/internal/cluster.go#L125-L160 which gets called through https://github.com/benmoss/cluster-api/blob/dd029668e3d1e35404cabe35c3a85604d2661278/controlplane/kubeadm/controllers/scale.go#L51-L53

benmoss · 2020-04-16T14:53:09Z

/hold

benmoss · 2020-04-16T15:16:18Z

/hold cancel

vincepri · 2020-04-16T17:20:24Z

/test pull-cluster-api-apidiff

k8s-ci-robot · 2020-04-16T17:22:58Z

@benmoss: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
pull-cluster-api-apidiff	`dd02966`	link	`/test pull-cluster-api-apidiff`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

vincepri · 2020-04-16T17:28:36Z

API changes looks good, it's all internal

fabriziopandini · 2020-04-16T20:18:41Z

@benmoss thanks for your answers!
for #2887 (comment), we should consider having a separate reconciliation loop to ensure the list of endpoints gets back in sync, but this is out of scope of this PR
/lgtm

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 9, 2020

k8s-ci-robot requested review from CecileRobertMichon and JoelSpeed April 9, 2020 13:51

benmoss changed the title ~~🏃Remove annotations~~ 🏃Remove annotations on upgradeControlPlane Apr 9, 2020

michaelgugino suggested changes Apr 9, 2020

View reviewed changes

k8s-ci-robot assigned fabriziopandini, sedefsavas and yastij Apr 9, 2020

k8s-ci-robot added this to the v0.3.4 milestone Apr 9, 2020

benmoss force-pushed the remove-annotations branch from 7e9daef to 8b550ef Compare April 9, 2020 17:51

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 9, 2020

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 9, 2020

vincepri reviewed Apr 9, 2020

View reviewed changes

controlplane/kubeadm/controllers/upgrade.go Show resolved Hide resolved

vincepri reviewed Apr 10, 2020

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 10, 2020

sedefsavas reviewed Apr 10, 2020

View reviewed changes

controlplane/kubeadm/controllers/upgrade.go Outdated Show resolved Hide resolved

vincepri reviewed Apr 14, 2020

View reviewed changes

controlplane/kubeadm/controllers/upgrade_test.go Show resolved Hide resolved

vincepri reviewed Apr 14, 2020

View reviewed changes

controlplane/kubeadm/controllers/scale.go Show resolved Hide resolved

benmoss force-pushed the remove-annotations branch from c5d2e8e to 8ff8401 Compare April 15, 2020 13:27

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 15, 2020

benmoss force-pushed the remove-annotations branch from 8ff8401 to a045cbc Compare April 15, 2020 13:29

benmoss force-pushed the remove-annotations branch from a045cbc to dd02966 Compare April 15, 2020 13:33

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 15, 2020

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 15, 2020

vincepri approved these changes Apr 16, 2020

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 16, 2020

fabriziopandini reviewed Apr 16, 2020

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 16, 2020

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 16, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 16, 2020

k8s-ci-robot merged commit 04d9db7 into kubernetes-sigs:master Apr 16, 2020

benmoss mentioned this pull request Apr 29, 2020

KubeadmControlPlane upgrade could potentially scale down prior to scaling back up #2430

Closed

furkatgofurov7 mentioned this pull request Nov 16, 2020

Mark KCP machines for deletion #3910

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🏃Remove annotations on upgradeControlPlane #2887

🏃Remove annotations on upgradeControlPlane #2887

benmoss commented Apr 9, 2020 •

edited by detiber

Loading

michaelgugino left a comment

detiber commented Apr 9, 2020

vincepri commented Apr 9, 2020

vincepri commented Apr 9, 2020

benmoss commented Apr 9, 2020 •

edited

Loading

benmoss commented Apr 9, 2020

benmoss commented Apr 9, 2020

vincepri left a comment

benmoss commented Apr 10, 2020

benmoss commented Apr 15, 2020

vincepri left a comment

k8s-ci-robot commented Apr 16, 2020

fabriziopandini left a comment

fabriziopandini Apr 16, 2020

benmoss Apr 16, 2020

fabriziopandini Apr 16, 2020

benmoss Apr 16, 2020

benmoss Apr 16, 2020

fabriziopandini Apr 16, 2020

benmoss Apr 16, 2020

fabriziopandini Apr 16, 2020

benmoss Apr 16, 2020 •

edited

Loading

benmoss commented Apr 16, 2020

benmoss commented Apr 16, 2020

vincepri commented Apr 16, 2020

k8s-ci-robot commented Apr 16, 2020

vincepri commented Apr 16, 2020

fabriziopandini commented Apr 16, 2020

🏃Remove annotations on upgradeControlPlane #2887

🏃Remove annotations on upgradeControlPlane #2887

Conversation

benmoss commented Apr 9, 2020 • edited by detiber Loading

michaelgugino left a comment

Choose a reason for hiding this comment

detiber commented Apr 9, 2020

vincepri commented Apr 9, 2020

vincepri commented Apr 9, 2020

benmoss commented Apr 9, 2020 • edited Loading

benmoss commented Apr 9, 2020

benmoss commented Apr 9, 2020

vincepri left a comment

Choose a reason for hiding this comment

benmoss commented Apr 10, 2020

benmoss commented Apr 15, 2020

vincepri left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Apr 16, 2020

fabriziopandini left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benmoss Apr 16, 2020 • edited Loading

Choose a reason for hiding this comment

benmoss commented Apr 16, 2020

benmoss commented Apr 16, 2020

vincepri commented Apr 16, 2020

k8s-ci-robot commented Apr 16, 2020

vincepri commented Apr 16, 2020

fabriziopandini commented Apr 16, 2020

benmoss commented Apr 9, 2020 •

edited by detiber

Loading

benmoss commented Apr 9, 2020 •

edited

Loading

benmoss Apr 16, 2020 •

edited

Loading