Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First pass for machine deployment controller #143

Merged
merged 1 commit into from
May 31, 2018

Conversation

k4leung4
Copy link
Contributor

@k4leung4 k4leung4 commented May 8, 2018

What this PR does / why we need it:
This is the first pass for the machine deployment controller.

Copies/borrows heavily from deployment controller.

Things not implemented in this PR:

  • deployment progress
  • orphan adoption

Caveat

  • rolling update does not work due to machineset status not implemented

@kubernetes/kube-deploy-reviewers

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label May 8, 2018
@k8s-ci-robot
Copy link
Contributor

Hi @k4leung4. Thanks for your PR.

I'm waiting for a kubernetes or kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 8, 2018
@roberthbailey
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 9, 2018
@k4leung4 k4leung4 force-pushed the machinedeployment branch 5 times, most recently from 547e422 to c3206b8 Compare May 9, 2018 19:50
@k4leung4
Copy link
Contributor Author

/assign @krousey

@k4leung4 k4leung4 force-pushed the machinedeployment branch 2 times, most recently from bdbcf1d to 8ee2058 Compare May 15, 2018 17:15
@krousey
Copy link
Contributor

krousey commented May 15, 2018

/assign @maisem

@k8s-ci-robot
Copy link
Contributor

@krousey: GitHub didn't allow me to assign the following users: maisem.

Note that only kubernetes-sigs members and repo collaborators can be assigned.

In response to this:

/assign @maisem

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

@dgoodwin dgoodwin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attempted a review, though I am mostly just learning how this works. Added some questions and suggestions for readability as a result.

This looks awesome and we're really looking forward to trying it out. I'd love to test it with our code but we weren't quite ready yet, however we will try to do so ASAP.

Thanks!

return nil, err
}

// TODO: flush out machine set adoption.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own curiosity is MachineSet adoption a common use case? Curious how this would typically be used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For pod deployments, replicasets, the use case is that the user created a replicaset by hand, and wanted to upgrade it, and so they would create a pod deployment with matching labels, so it could adopt it and then upgrade using deployment strategies.

For machine deployments, it would be the equivalent of someone creating a machineset by hand, and wanting to upgrade it using machine deployment strategies.

}

// updateMachineSet figures out what deployment(s) manage a MachineSet when the MachineSet
// is updated and wake them up. If the anything of the MachineSets have changed, we need to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple suggestions for godoc clarity, would this be accurate?

"If anything on the MachineSet has changed we need to reconcile it's current MachineDeployment. If the MachineSet's controller reference has changed, we must also reconcile the it's old MachineDeployment."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct.
updated godoc

}

// FindNewMachineSet returns the new MS this given deployment targets (the one with the same machine template).
func FindNewMachineSet(deployment *v1alpha1.MachineDeployment, msList []*v1alpha1.MachineSet) *v1alpha1.MachineSet {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would "FindCurrentMachineSet" be a better name for this function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if "current" is really better. IMHO, it all comes down to terminology.
"new" refers to the new machine set that we are upgrading to, and "old" refers to the old machine set that we are upgrading from.
In a sense, "current" runs into the same issue that both the machineset that we are upgrading from and upgrading to are current.
I am all for better naming, as I too had difficulty following the deployment controller code, where this code is based off of. Though I am not sure if renaming it to "current" helps much.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds ok, new is very pervasive and consistent here so this sounds ok to me now.

// In rare cases, such as after cluster upgrades, Deployment may end up with
// having more than one new MachineSets that have the same template as its template,
// see https://github.com/kubernetes/kubernetes/issues/40415
// We deterministically choose the oldest new MachineSet.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest "oldest MachineSet with matching template hash."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

for i := range msList {
if EqualIgnoreHash(&msList[i].Spec.Template, &deployment.Spec.Template) {
// In rare cases, such as after cluster upgrades, Deployment may end up with
// having more than one new MachineSets that have the same template as its template,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest dropping "as its template".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// having more than one new MachineSets that have the same template as its template,
// see https://github.com/kubernetes/kubernetes/issues/40415
// We deterministically choose the oldest new MachineSet.
return msList[i]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify why it's correct to choose the oldest matching MachineSet? I'm not familiar with the specifics around how we end up with two (how they differ), but it strikes me in that situation you could want the newest MachineSet with matching template.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is to say choosing the oldest matching MachineSet is always the correct choice. I too believe that there are scenarios where it is correct to choose the newest one.
The choice to choose the oldest one here is for it to be deterministic, instead of a random behavior that is dependent on the ordering of the list.

)

// syncStatusOnly only updates Deployments Status and doesn't take any mutating actions.
func (dc *MachineDeploymentControllerImpl) syncStatusOnly(d *v1alpha1.MachineDeployment, msList []*v1alpha1.MachineSet, machineMap map[types.UID]*v1alpha1.MachineList) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible unused function here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are correct, removing unused code.

}

// FindOldMachineSets returns the old machine sets targeted by the given Deployment, with the given slice of MSes.
// Note that the first set of old machine sets doesn't include the ones with no machines, and the second set of old machine sets include all old machine sets.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest rewording that this returns a list of all old machine sets with replicas scaled to 0, and a list includes all old machine sets regardless of replicas.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

// rolling deployment.
if dutil.IsRollingUpdate(deployment) {
allMSs := dutil.FilterActiveMachineSets(append(oldMSs, newMS))
allMSsReplicas := dutil.GetReplicaCountForMachineSets(allMSs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest a rename to "totalMSReplicas".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


// calculateStatus calculates the latest status for the provided deployment by looking into the provided machine sets.
func calculateStatus(allMSs []*v1alpha1.MachineSet, newMS *v1alpha1.MachineSet, deployment *v1alpha1.MachineDeployment) v1alpha1.MachineDeploymentStatus {
availableReplicas := dutil.GetAvailableReplicaCountForMachineSets(allMSs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR mentioned MachineSet Status not implemented yet, is this still accurate and the code is just prepared for when AvailableReplicas is properly set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still accurate and you are correct.
The MachineSet Status has the fields for AvailableReplicas, but the MachineSet controller currently does not populate it.
The method pulls value out of the MachineSet Status, but since it isn't populated, it always returns 0, which is the default.

@k4leung4 k4leung4 force-pushed the machinedeployment branch 2 times, most recently from a161b75 to 4ff2e87 Compare May 16, 2018 16:55
return
}
glog.V(4).Infof("MachineSet %s added for deployment %v.", ms.Name, d.Name)
c.Reconcile(d)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of calling Reconcile directly, you should queue the MachineDeployment. Otherwise, you could have Reconcile called simultaneously on the same MachineDeployment. (Same applies to all calls to Reconcile from event handlers)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, this is for the case when you receive a MachineSet event and not for a MachineDeployment event. Maybe we should use AdditionalInformers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can get access to the queue in your Init function with:

queue := arguments.GetSharedInformers().WorkerQueues["MachineDeployment"].Queue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks Cesar, i found the same thing.
I updated the PR to add to the queue rather than calling Reconcile directly.

@k4leung4 k4leung4 force-pushed the machinedeployment branch from 4ff2e87 to 0b69744 Compare May 17, 2018 15:22
@k4leung4
Copy link
Contributor Author

/assign @justinsb

@k4leung4
Copy link
Contributor Author

/cc @justinsb

@k8s-ci-robot k8s-ci-robot requested a review from justinsb May 21, 2018 22:28

var filteredMS []*v1alpha1.MachineSet
for _, ms := range msList {
if metav1.GetControllerOf(ms) != nil && !metav1.IsControlledBy(ms, d) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a MS doesn't have a controller, we want to ignore it, I think? So if metav1.GetController(ms) == nil || ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch, was originally thinking of handling it when I added the adoption code, but it makes sense to ignore it for now.
fixed.

}
selector, err := metav1.LabelSelectorAsSelector(&d.Spec.Selector)
if err != nil {
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this happen? if so maybe log here to save our future selves?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not ever happen, as the validation for the object should have ensured a valid label, but logging and error makes sense.
added.

}
// If a deployment with a nil or empty selector creeps in, it should match nothing, not everything.
if selector.Empty() || !selector.Matches(labels.Set(ms.Labels)) {
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a warning in the empty case also?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added better logging whenever a machineset is skipped.

if reflect.DeepEqual(d.Spec.Selector, &everything) {
if d.Status.ObservedGeneration < d.Generation {
d.Status.ObservedGeneration = d.Generation
c.machineClient.ClusterV1alpha1().MachineDeployments(d.Namespace).UpdateStatus(d)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we ignoring an error here? Maybe log if so...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return on err, added logging.

}

// Otherwise, it's an orphan. If anything changed, sync matching controllers
// to see if anyone wants to adopt it now.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this means that if we remove the owner from the machineset we'll likely just readopt it, but I think that's OK

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is what we would want in the adoption case.

// We can't look up by UID, so look up by Name and then verify UID.
// Don't even try to look up by Name if it's the wrong Kind.
if controllerRef.Kind != controllerKind.Kind {
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like some glog.Warningf in the 3 return nil cases would be helpful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, added.

{
name: "scenario 1: 10 desired, oldMS at 10, 10 ready, 0 max unavailable => oldMS at 9, scale down by 1.",
deploymentReplicas: 10,
maxUnavailable: intstr.FromInt(0),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this 0 unavailable & 0 max surge? I thought that was not allowed because we're stuck...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that is an invalid configuration. this would normally have been safe guarded at a higher level.
fixed test.

readyMachines: 10,
oldReplicas: 10,
scaleExpected: true,
expectedOldReplicas: 9,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we violate maxUnavailable=0 as well here (?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated test.

return dc.machineClient.ClusterV1alpha1().MachineSets(msCopy.ObjectMeta.Namespace).Update(msCopy)
}

// Should use the revision in existingNewMS's annotation, since it set by before
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the comment.

we attempt to copy over the revision annotation over to the deployment if it doesn't already have it.

}

needsUpdate := dutil.SetDeploymentRevision(d, newRevision)
if !alreadyExists && d.Spec.ProgressDeadlineSeconds != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why we're checking ProgressDeadlineSeconds here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no reason, removed.

case 1:
return allMSs[0]
default:
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this behaviour intended? It doesn't really match the comment I think - allMSs[0] is the latest I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming was off. It really wanted one active or latest. If there are multiple machine sets that are available, it needs to start scaling down some of the older machine sets first to avoid having too many active machine sets.

renamed the method and updated the comment.

@k4leung4 k4leung4 force-pushed the machinedeployment branch from 0b69744 to 69d3de3 Compare May 22, 2018 17:55
@k4leung4
Copy link
Contributor Author

/assign @justinsb

@k4leung4
Copy link
Contributor Author

/unassign @krousey

@k4leung4
Copy link
Contributor Author

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 29, 2018
@k4leung4 k4leung4 force-pushed the machinedeployment branch 2 times, most recently from 7f4027e to c461beb Compare May 30, 2018 01:25
Copies/borrows heavily from deployment controller.

Things not implemented in this PR:

deployment progress
orphan adoption
@k4leung4 k4leung4 force-pushed the machinedeployment branch from c461beb to 7b1f291 Compare May 30, 2018 15:51
@maisem
Copy link

maisem commented May 30, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 30, 2018
@k4leung4
Copy link
Contributor Author

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 31, 2018
@k8s-ci-robot k8s-ci-robot merged commit e929a2f into kubernetes-sigs:master May 31, 2018
k4leung4 added a commit to k4leung4/cluster-api that referenced this pull request May 31, 2018
…roller

This is to pick up PR kubernetes-sigs#143, which allows for the creation of machine
deployments. Rolling update should now work.
k4leung4 added a commit to k4leung4/cluster-api that referenced this pull request May 31, 2018
…roller

This is to pick up PR kubernetes-sigs#143, which allows for the creation of machine
deployments. Rolling update should now work.
k4leung4 added a commit to k4leung4/cluster-api that referenced this pull request May 31, 2018
…roller

This is to pick up PR kubernetes-sigs#143, which allows for the creation of machine
deployments. Rolling update should now work.

Update apiserver for vsphere as that is what is being used for gce and
terraform.
k8s-ci-robot pushed a commit that referenced this pull request May 31, 2018
…roller (#260)

This is to pick up PR #143, which allows for the creation of machine
deployments. Rolling update should now work.

Update apiserver for vsphere as that is what is being used for gce and
terraform.
jayunit100 pushed a commit to jayunit100/cluster-api that referenced this pull request Jan 31, 2020
…roller (kubernetes-sigs#260)

This is to pick up PR kubernetes-sigs#143, which allows for the creation of machine
deployments. Rolling update should now work.

Update apiserver for vsphere as that is what is being used for gce and
terraform.
jayunit100 pushed a commit to jayunit100/cluster-api that referenced this pull request Jan 31, 2020
with the patch, clusterctl will been compiled, packaged into container
and uploaded to gcr, then run as initContainer along with bootstrap_job
container, bootstrap_job container will share a volume with initContainer
so that clusterctl can be called in bootstrap_job container.
enxebre added a commit to enxebre/cluster-api that referenced this pull request Sep 15, 2021
 This was first added here kubernetes-sigs#143 and it's never been used.
enxebre added a commit to enxebre/cluster-api that referenced this pull request Sep 15, 2021
 This was first added here kubernetes-sigs#143 and it's never been used.
enxebre added a commit to enxebre/cluster-api that referenced this pull request Sep 15, 2021
 This was first added here kubernetes-sigs#143 and it's never been used.
enxebre added a commit to enxebre/cluster-api that referenced this pull request Sep 15, 2021
 This was first added here kubernetes-sigs#143 and it's never been used.
sbueringer pushed a commit to sbueringer/cluster-api that referenced this pull request Sep 20, 2021
 This was first added here kubernetes-sigs#143 and it's never been used.
sbueringer pushed a commit to sbueringer/cluster-api that referenced this pull request Sep 21, 2021
 This was first added here kubernetes-sigs#143 and it's never been used.
Jont828 pushed a commit to Jont828/cluster-api that referenced this pull request Oct 26, 2021
 This was first added here kubernetes-sigs#143 and it's never been used.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants