-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support tainting all nodes needing update during rolling update #8021
Conversation
Hi @johngmyers. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/area rolling-update |
/kind feature |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/ok-to-test
Per comment in #7902, cordoning will cause the nodes to be taken out of the AWS ELB's group, so this will need to take a gentler approach to making the nodes unschedulable. I would still appreciate feedback on the approach to doing configuration of strategies. /hold |
/hold cancel |
0a7882c
to
ceb9409
Compare
/retest |
/test pull-kops-e2e-kubernetes-aws |
/test pull-kops-verify-staticcheck |
/test pull-kops-e2e-kubernetes-aws |
Let me know if I should rebase and clean up the commit stream. |
pkg/apis/kops/cluster.go
Outdated
|
||
type RollingUpdate struct { | ||
// TaintAllNeedUpdate taints all nodes in the instance group that need update before draining any of them | ||
TaintAllNeedUpdate *bool `json:"taintAllNeedUpdate,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not 100% sure about this name - should this be "cordon"? Maybe preTaint
or taintAllFirst
?
Though I do see exactly how you came up with this name ... my suggestions don't specify that we only do instances that we are rolling. We could optimize the name for the common case, where we are rolling all the instances in the group - it's only when an update was interrupted that it won't be all of them (I think!)
What about if we state the intent, rather than the mechanism: avoidRepeatedPodScheduling
or something along those lines?
OTOH, maybe this doesn't matter, because I would actually expect we treat unspecified as true once we are happy, because it seems the right strategy (I think!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We taint instead of cordon because cordoning can take the node out of rotation of the AWS ELB. Not a good thing to happen to all the nodes of an instancegroup.
I have a pending PR for allowing an external process to nominate nodes for updating, for example if they are older than site policy permits. Also, cluster autoscaler could add new nodes between spec update and rolling update. So interruption of rolling update isn't necessarily the only case where not all nodes in the instancegroup need updating.
It would be unfortunate if someone enabled this on an instancegroup that doesn't have autoscaling (or 100% surging), so I think having some indication to the admin that tainting will happen is desirable.
pkg/featureflag/featureflag.go
Outdated
@@ -84,6 +84,8 @@ var ( | |||
VSphereCloudProvider = New("VSphereCloudProvider", Bool(false)) | |||
// SkipEtcdVersionCheck will bypass the check that etcd-manager is using a supported etcd version | |||
SkipEtcdVersionCheck = New("SkipEtcdVersionCheck", Bool(false)) | |||
// ConfigurableRollingUpdate enables the RollingUpdate strategy configuration settings | |||
ConfigurableRollingUpdate = New("ConfigurableRollingUpdate", Bool(false)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a featureflag, or can we rely on the idea that if users don't specify the field it won't get activated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be a good idea to have a featureflag until we have the whole set of features in. Then we might consider whether we want to consolidate some of the settings.
pkg/instancegroups/instancegroups.go
Outdated
settings := resolveSettings(cluster, r.CloudGroup.InstanceGroup) | ||
|
||
for uIdx, u := range update { | ||
if featureflag.ConfigurableRollingUpdate.Enabled() && *settings.TaintAllNeedUpdate { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it's tainting each node before draining?
I understand why it's in the loop, but see the next comment. It might be easier to use PreferNoSchedule
, and then we might not need to wait for the first successful new node.
I do wonder if we should undo the tainting on failure, but as these nodes no longer map to an InstanceGroup, we probably should keep them tainted, reflecting their true state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently kops doesn't uncordon a node on a failed drain. The only automated way to recover is to do another rolling update to complete the update.
In case the instance spec is reverted, we might add code to ensure that nodes with our taint are considered NotReady and thus updated on a subsequent rolling update. But this feature is likely to be only enabled on instancegroups with cluster autoscaler enabled, so that will eventually remove them for being underused.
pkg/instancegroups/instancegroups.go
Outdated
} | ||
|
||
if noneReady { | ||
// Wait until after one node is deleted and its replacement validates before the mass-cordoning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh - this is clever :-)
It does get more complicated if we start rolling multiple nodes and/or temporarily creating more nodes ("surge upgrades"), but I guess in all cases we just wait for the first.
We can also "soft taint" the nodes ("preferred" not "forbidden'), so that the scheduler will still schedule pods back to the old nodes if something goes wrong. This also allows for e.g. if something goes wrong concurrently with the nodes and we start losing them. I don't know if we should rely on that or this "wait for success before tainting" approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, for both MaxUnavailable and MaxSurge, I intend to also have this toe-dipping behavior. You really don't want a whole fleet of dead surged instances. Especially if it takes two or three tries to get a working spec.
if len(toTaint) > 0 { | ||
noun := "nodes" | ||
if len(toTaint) == 1 { | ||
noun = "node" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's normally fine not to worry about this, or just do node(s)
, but this does look better!
pkg/instancegroups/instancegroups.go
Outdated
|
||
node.Spec.Taints = append(node.Spec.Taints, corev1.Taint{ | ||
Key: rollingUpdateTaintKey, | ||
Effect: corev1.TaintEffectNoSchedule, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where we could use TaintEffectPreferNoSchedule
, instead of the more complex "wait for one" heuristic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also do both...
The downside of a soft taint is that it would not cause cluster autoscaler to step in and create new instances.
The one thing I haven't figured out is how to get the overprovisioning pods evicted from the tainted nodes, so the pods can perform their intended function. Unfortunately, the requiredDuringSchedulingRequiredDuringExecution
anti-affinity isn't implemented. Perhaps the rolling update code needs to search for and preemptively delete them.
return err | ||
} | ||
|
||
_, err = rollingUpdateData.K8sClient.CoreV1().Nodes().Patch(node.Name, types.StrategicMergePatchType, patchBytes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to say it would be good to log the patch so we knew CreateTwoWayMergePatch was doing what we thought, but then I saw you have a test - even better 👍
This looks really great @johngmyers thanks so much & sorry about the delays! A few comments that I've posted above:
The name of the field is the only thing that's not really changeable, so that's the only real blocker in my view. (I know it's annoying because it's probably the least important thing technically). Just to throw out a strawman, I'd be happy with |
I think the first thing to decide is whether it should be a hard or soft taint. If it's soft, then there probably isn't a need for a setting at all. We could just do this always, including for masters. With hard, the setting is needed for instancegroups without either cluster autoscaling or 100% surging. The disadvantage of soft is that it's less effective at avoiding the repeating pod rescheduling. Cluster autoscaler won't create new nodes until all the old ones are filled to capacity. We could have the setting choose between hard/soft instead of hard/none. |
I think it makes sense to proceed with soft tainting for now. There is the possibility of later adding a setting for choosing between soft and hard tainting, should we need it. |
/test pull-kops-e2e-kubernetes-aws |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any dealbreakers here. if we can get aligned on our labels and naming, i'm +1
pkg/instancegroups/instancegroups.go
Outdated
@@ -34,6 +37,8 @@ import ( | |||
"k8s.io/kops/upup/pkg/fi" | |||
) | |||
|
|||
const rollingUpdateTaintKey = "kops.k8s.io/rolling-update" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps .../rolling-update-in-progress
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd use something like /scheduled-for-update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm on board with that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
} | ||
} | ||
} | ||
if len(toTaint) > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we haven't worried about this yet, (as per jsb's comment) but i'd be happy to see this pulled out to a util function somewhere. I'd be happy to use a helper like this as we build out GCE support.
@@ -221,6 +233,64 @@ func (r *RollingUpdateInstanceGroup) RollingUpdate(rollingUpdateData *RollingUpd | |||
return nil | |||
} | |||
|
|||
func (r *RollingUpdateInstanceGroup) taintAllNeedUpdate(update []*cloudinstances.CloudInstanceGroupMember, rollingUpdateData *RollingUpdateCluster) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pretty much the only thing I don't like is the naming here. I would prefer something like taintOutdatedNodes
, but i'm not a hard no, it just feels awkward to me.
@@ -137,6 +142,13 @@ func (r *RollingUpdateInstanceGroup) RollingUpdate(rollingUpdateData *RollingUpd | |||
} | |||
} | |||
|
|||
if !rollingUpdateData.CloudOnly { | |||
err = r.taintAllNeedUpdate(update, rollingUpdateData) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tip (that I am a recent convert to): Using if err := foo(); err != nil { return err }
avoids problems with variable shadowing (though it really confuses the go static analysis tools, and it only works when it's a single error retval!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'll try it out.
Thanks @johngmyers - this is really a great improvement /approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: johngmyers, justinsb The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
Adds a per-instancegroup option to taint all nodes needing update near the start of a rolling update. The expectation is that this would only be enabled on instancegroups which have autoscaling enabled and which have workloads which can tolerate waiting for scale-up if needed.
Extends the InstanceGroup API to support configuration of the rolling update strategy. Extends the Cluster API to support per-cluster defaults for same. The configuration options are behind the new
ConfigurableRollingUpdate
feature flag, which defaults to off.In order to limit the damage should the new instance specification produce instances which fail validation, in the case where there are no existing instances with the current spec the strategy will first cordon and update a single instance, waiting until its replacement validates before tainting the rest.
Fixes #7958