Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support tainting all nodes needing update during rolling update #8021

Merged
merged 6 commits into from
Jan 4, 2020

Conversation

johngmyers
Copy link
Member

@johngmyers johngmyers commented Nov 27, 2019

Adds a per-instancegroup option to taint all nodes needing update near the start of a rolling update. The expectation is that this would only be enabled on instancegroups which have autoscaling enabled and which have workloads which can tolerate waiting for scale-up if needed.

Extends the InstanceGroup API to support configuration of the rolling update strategy. Extends the Cluster API to support per-cluster defaults for same. The configuration options are behind the new ConfigurableRollingUpdate feature flag, which defaults to off.

In order to limit the damage should the new instance specification produce instances which fail validation, in the case where there are no existing instances with the current spec the strategy will first cordon and update a single instance, waiting until its replacement validates before tainting the rest.

Fixes #7958

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 27, 2019
@k8s-ci-robot
Copy link
Contributor

Hi @johngmyers. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Nov 27, 2019
@johngmyers
Copy link
Member Author

/area rolling-update

@johngmyers
Copy link
Member Author

/kind feature

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. labels Nov 27, 2019
Copy link
Member

@zetaab zetaab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Nov 27, 2019
@johngmyers
Copy link
Member Author

Per comment in #7902, cordoning will cause the nodes to be taken out of the AWS ELB's group, so this will need to take a gentler approach to making the nodes unschedulable. I would still appreciate feedback on the approach to doing configuration of strategies.

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 28, 2019
@johngmyers johngmyers changed the title Support cordoning all nodes needing update during rolling update Support tainting all nodes needing update during rolling update Nov 28, 2019
@johngmyers
Copy link
Member Author

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 29, 2019
@johngmyers johngmyers force-pushed the cordon branch 2 times, most recently from 0a7882c to ceb9409 Compare December 6, 2019 06:22
@granular-ryanbonham granular-ryanbonham self-assigned this Dec 6, 2019
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 7, 2019
@johngmyers
Copy link
Member Author

/retest

@johngmyers
Copy link
Member Author

/test pull-kops-e2e-kubernetes-aws

@johngmyers
Copy link
Member Author

/test pull-kops-verify-staticcheck

@johngmyers
Copy link
Member Author

/test pull-kops-e2e-kubernetes-aws

@johngmyers
Copy link
Member Author

Let me know if I should rebase and clean up the commit stream.


type RollingUpdate struct {
// TaintAllNeedUpdate taints all nodes in the instance group that need update before draining any of them
TaintAllNeedUpdate *bool `json:"taintAllNeedUpdate,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not 100% sure about this name - should this be "cordon"? Maybe preTaint or taintAllFirst?

Though I do see exactly how you came up with this name ... my suggestions don't specify that we only do instances that we are rolling. We could optimize the name for the common case, where we are rolling all the instances in the group - it's only when an update was interrupted that it won't be all of them (I think!)

What about if we state the intent, rather than the mechanism: avoidRepeatedPodScheduling or something along those lines?

OTOH, maybe this doesn't matter, because I would actually expect we treat unspecified as true once we are happy, because it seems the right strategy (I think!)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We taint instead of cordon because cordoning can take the node out of rotation of the AWS ELB. Not a good thing to happen to all the nodes of an instancegroup.

I have a pending PR for allowing an external process to nominate nodes for updating, for example if they are older than site policy permits. Also, cluster autoscaler could add new nodes between spec update and rolling update. So interruption of rolling update isn't necessarily the only case where not all nodes in the instancegroup need updating.

It would be unfortunate if someone enabled this on an instancegroup that doesn't have autoscaling (or 100% surging), so I think having some indication to the admin that tainting will happen is desirable.

@@ -84,6 +84,8 @@ var (
VSphereCloudProvider = New("VSphereCloudProvider", Bool(false))
// SkipEtcdVersionCheck will bypass the check that etcd-manager is using a supported etcd version
SkipEtcdVersionCheck = New("SkipEtcdVersionCheck", Bool(false))
// ConfigurableRollingUpdate enables the RollingUpdate strategy configuration settings
ConfigurableRollingUpdate = New("ConfigurableRollingUpdate", Bool(false))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a featureflag, or can we rely on the idea that if users don't specify the field it won't get activated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be a good idea to have a featureflag until we have the whole set of features in. Then we might consider whether we want to consolidate some of the settings.

settings := resolveSettings(cluster, r.CloudGroup.InstanceGroup)

for uIdx, u := range update {
if featureflag.ConfigurableRollingUpdate.Enabled() && *settings.TaintAllNeedUpdate {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it's tainting each node before draining?

I understand why it's in the loop, but see the next comment. It might be easier to use PreferNoSchedule, and then we might not need to wait for the first successful new node.

I do wonder if we should undo the tainting on failure, but as these nodes no longer map to an InstanceGroup, we probably should keep them tainted, reflecting their true state.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently kops doesn't uncordon a node on a failed drain. The only automated way to recover is to do another rolling update to complete the update.

In case the instance spec is reverted, we might add code to ensure that nodes with our taint are considered NotReady and thus updated on a subsequent rolling update. But this feature is likely to be only enabled on instancegroups with cluster autoscaler enabled, so that will eventually remove them for being underused.

}

if noneReady {
// Wait until after one node is deleted and its replacement validates before the mass-cordoning
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh - this is clever :-)

It does get more complicated if we start rolling multiple nodes and/or temporarily creating more nodes ("surge upgrades"), but I guess in all cases we just wait for the first.

We can also "soft taint" the nodes ("preferred" not "forbidden'), so that the scheduler will still schedule pods back to the old nodes if something goes wrong. This also allows for e.g. if something goes wrong concurrently with the nodes and we start losing them. I don't know if we should rely on that or this "wait for success before tainting" approach?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for both MaxUnavailable and MaxSurge, I intend to also have this toe-dipping behavior. You really don't want a whole fleet of dead surged instances. Especially if it takes two or three tries to get a working spec.

if len(toTaint) > 0 {
noun := "nodes"
if len(toTaint) == 1 {
noun = "node"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's normally fine not to worry about this, or just do node(s), but this does look better!


node.Spec.Taints = append(node.Spec.Taints, corev1.Taint{
Key: rollingUpdateTaintKey,
Effect: corev1.TaintEffectNoSchedule,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where we could use TaintEffectPreferNoSchedule, instead of the more complex "wait for one" heuristic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also do both...

The downside of a soft taint is that it would not cause cluster autoscaler to step in and create new instances.

The one thing I haven't figured out is how to get the overprovisioning pods evicted from the tainted nodes, so the pods can perform their intended function. Unfortunately, the requiredDuringSchedulingRequiredDuringExecution anti-affinity isn't implemented. Perhaps the rolling update code needs to search for and preemptively delete them.

return err
}

_, err = rollingUpdateData.K8sClient.CoreV1().Nodes().Patch(node.Name, types.StrategicMergePatchType, patchBytes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to say it would be good to log the patch so we knew CreateTwoWayMergePatch was doing what we thought, but then I saw you have a test - even better 👍

@justinsb
Copy link
Member

This looks really great @johngmyers thanks so much & sorry about the delays!

A few comments that I've posted above:

  • field naming, as is always the case. I suggested naming to focus on the intent ("avoiding repeated pod bounces") and optimizing for the common case where all the machines in an instancegroup are rolling.
  • I wonder if we should just "soft" taint the old nodes with PreferNoSchedule, and then maybe we don't need the "wait for one" logic. Though I do like that!
  • I don't know if we need the featureflag, if users have to specify the field explicitly. But OTOH I also want to make it the default, and have the field unspecified turn on this behaviour - but I recognize we don't necessarily want to do that on day 1.

The name of the field is the only thing that's not really changeable, so that's the only real blocker in my view. (I know it's annoying because it's probably the least important thing technically). Just to throw out a strawman, I'd be happy with avoidPodRescheduling if that name makes sense for you? But really anything intent based...

@johngmyers
Copy link
Member Author

I think the first thing to decide is whether it should be a hard or soft taint.

If it's soft, then there probably isn't a need for a setting at all. We could just do this always, including for masters. With hard, the setting is needed for instancegroups without either cluster autoscaling or 100% surging.

The disadvantage of soft is that it's less effective at avoiding the repeating pod rescheduling. Cluster autoscaler won't create new nodes until all the old ones are filled to capacity.

We could have the setting choose between hard/soft instead of hard/none.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 31, 2019
@johngmyers
Copy link
Member Author

I think it makes sense to proceed with soft tainting for now. There is the possibility of later adding a setting for choosing between soft and hard tainting, should we need it.

@johngmyers
Copy link
Member Author

/test pull-kops-e2e-kubernetes-aws

Copy link
Member

@geojaz geojaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any dealbreakers here. if we can get aligned on our labels and naming, i'm +1

@@ -34,6 +37,8 @@ import (
"k8s.io/kops/upup/pkg/fi"
)

const rollingUpdateTaintKey = "kops.k8s.io/rolling-update"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps .../rolling-update-in-progress?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd use something like /scheduled-for-update

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm on board with that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

}
}
}
if len(toTaint) > 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we haven't worried about this yet, (as per jsb's comment) but i'd be happy to see this pulled out to a util function somewhere. I'd be happy to use a helper like this as we build out GCE support.

@@ -221,6 +233,64 @@ func (r *RollingUpdateInstanceGroup) RollingUpdate(rollingUpdateData *RollingUpd
return nil
}

func (r *RollingUpdateInstanceGroup) taintAllNeedUpdate(update []*cloudinstances.CloudInstanceGroupMember, rollingUpdateData *RollingUpdateCluster) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty much the only thing I don't like is the naming here. I would prefer something like taintOutdatedNodes, but i'm not a hard no, it just feels awkward to me.

@@ -137,6 +142,13 @@ func (r *RollingUpdateInstanceGroup) RollingUpdate(rollingUpdateData *RollingUpd
}
}

if !rollingUpdateData.CloudOnly {
err = r.taintAllNeedUpdate(update, rollingUpdateData)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tip (that I am a recent convert to): Using if err := foo(); err != nil { return err } avoids problems with variable shadowing (though it really confuses the go static analysis tools, and it only works when it's a single error retval!)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll try it out.

@justinsb
Copy link
Member

justinsb commented Jan 4, 2020

Thanks @johngmyers - this is really a great improvement

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 4, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johngmyers, justinsb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 4, 2020
@rifelpet
Copy link
Member

rifelpet commented Jan 4, 2020

/retest

@k8s-ci-robot k8s-ci-robot merged commit 5ecf8d9 into kubernetes:master Jan 4, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.18 milestone Jan 4, 2020
@johngmyers johngmyers deleted the cordon branch January 4, 2020 17:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/rolling-update cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KubeCon 2019 User Feedback: Reduce pods bounces durring rolling updates.
8 participants