-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid waiting on validation during rolling update for inapplicable instance groups #10065
Avoid waiting on validation during rolling update for inapplicable instance groups #10065
Conversation
Welcome @bharath-123! |
Hi @bharath-123. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
Seems like some static checks and format tests are failing. Should have run gofmt and a bunch of other static checks before raising the PR. Will do so now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll need to run make gofmt
and fix that duplicate import in order to fix those failed tests.
A useful unit test for rolling update would be to mock in a validation failure for a nodes group that is not selected for update and verify that the rolling update completes regardless.
The validation unit tests should be updated to assert that the instanceGroup field is filled in appropriately.
In collectPodFailures
, failures for pods of priority system-node-critical
should be marked as being specific to the pod's node's group. These are typically things like CNI DaemonSets. Pods that are system-cluster-critical
, on the other hand, are cluster-level failures.
pkg/instancegroups/instancegroups.go
Outdated
// TODO: Should we check if we have enough time left before the deadline? | ||
if shouldWaitBeforeRetry(result.Failures, group) { | ||
time.Sleep(c.ValidateTickDuration) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a logic error here: when shouldWaitBeforeRetry fails, instead of letting the validation succeed and proceeding with updating the next node it instead goes into a fast spinloop.
When all failures are inapplicable it should go through the code block that increments successCount
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question, even if the failures are inapplicable and we skip the wait period. We definitely want to log the failures right? I think we would need to change the log messages here if I am right. If the validation fails and the failures are inapplicable, showing the failure log messages and a cluster validated messages would be very unintuitive for the sys admins.
I am planning to copy this block of code and add in the else stmt if shouldWaitBeforeRetry fails. I will change up the log messages to indicate that there were failures but they were ignored because they are in a different instance group.
successCount++
if successCount >= validateCount {
klog.Info("Cluster validated.")
return nil
}
klog.Infof("Cluster validated; revalidating in %s to make sure it does not flap.", c.ValidateSuccessDuration)
time.Sleep(c.ValidateSuccessDuration)
continue
@@ -154,22 +155,24 @@ func Test_ValidateCloudGroupMissing(t *testing.T) { | |||
Kind: "InstanceGroup", | |||
Name: "node-1", | |||
Message: "InstanceGroup \"node-1\" is missing from the cloud provider", | |||
InstanceGroup: instanceGroup, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: you could have made a smaller change by using instanceGroups[0]
.
/retest |
If the |
@johngmyers I was not able to figure out how to infer the InstanceGroup for a pod in collectPodFailures? The pod has a NodeName field in it's podspec. But I was not really sure how we could infer the InstanceGroup for a pod from that? |
I will follow this up with the unit test and the code changes that you have suggested. Thank you. |
It would be better to edit the commit stream so that the formatting and staticcheck issues were fixed in their original commits. I like the separation between the first and second commits. The third commit should be combined with the second as it is part of the same logical change. |
Shouldn't be difficult if we can figure out how to infer the instanceGroup from a pod given its structure. |
yes I agree with this. I will anyways be doing a whole bunch of changes now. So will revise the commit stream. |
|
The rolling update code should probably remove the inapplicable failures from what it logs. |
pkg/validation/validate_cluster.go
Outdated
@@ -25,6 +25,7 @@ import ( | |||
|
|||
"k8s.io/apimachinery/pkg/runtime" | |||
"k8s.io/client-go/tools/pager" | |||
kopsapi "k8s.io/kops/pkg/apis/kops" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should already be imported as kops below, I believe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes @justinsb will be doing this.
Makes sense. I am working on a different implementation of this feature now. |
9fe80ab
to
b9927ac
Compare
@johngmyers @justinsb @hakman @olemarkus @zetaab do have a look at this. This is a complete implementation with unit tests added for rolling update. Do review this. |
There are a couple of unused imports and unrelated comments which I had added when I was trying a different implementation. I ll remove them now. |
b9927ac
to
d4e762d
Compare
gentle ping @johngmyers @justinsb @hakman |
@bharath-123 Feel free to remove |
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see where in the code failures specific to a master IG cause rolling update to be blocked from proceeding.
From comments it seems you intended to put this logic in the validation code. I think it might be simpler to put it in the rolling update code that examines the returned list of failures.
pkg/instancegroups/instancegroups.go
Outdated
@@ -445,7 +446,7 @@ func (c *RollingUpdateCluster) validateClusterWithTimeout(validateCount int) err | |||
for { | |||
// Note that we validate at least once before checking the timeout, in case the cluster is healthy with a short timeout | |||
result, err := c.ClusterValidator.Validate() | |||
if err == nil && len(result.Failures) == 0 { | |||
if err == nil && (len(result.Failures) == 0 || isNotRelatedInstanceGroupError(result.Failures, group)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The len(result.Failures) == 0
is redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. Will remove it.
pkg/instancegroups/instancegroups.go
Outdated
// TODO: Should we check if we have enough time left before the deadline? | ||
time.Sleep(c.ValidateTickDuration) | ||
} | ||
|
||
return fmt.Errorf("cluster did not validate within a duration of %q", c.ValidationTimeout) | ||
} | ||
|
||
func isNotRelatedInstanceGroupError(failures []*validation.ValidationError, group *cloudinstances.CloudInstanceGroup) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sense of the negation is confusing. They're also "failures"; "errors" are something else. I would recommend inverting the sense of the return value, calling this something like hasFailureRelevantToGroup()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. makes sense. Will do that.
@@ -105,6 +105,30 @@ func (*erroringClusterValidator) Validate() (*validation.ValidationCluster, erro | |||
return nil, errors.New("testing validation error") | |||
} | |||
|
|||
// simulates failures in a specific node group in the map of instance groups |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// simulates failures in a specific node group in the map of instance groups | |
// instanceGroupNodeSpecificErrorClusterValidator simulates failures in a specific node group in the map of instance groups. |
pkg/validation/validate_cluster.go
Outdated
// optional field to indicate which instance group this validation error is coming from | ||
// if InstanceGroup is nil, then the validation error is either a cluster wide error | ||
// or an error in the master instancegroup which indicates a cluster wide error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// optional field to indicate which instance group this validation error is coming from | |
// if InstanceGroup is nil, then the validation error is either a cluster wide error | |
// or an error in the master instancegroup which indicates a cluster wide error | |
// InstanceGroup is an optional field to indicate which instance group this validation error is coming from. | |
// If nil, then the validation error is either a cluster wide error | |
// or an error in the master instancegroup which indicates a cluster wide error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Acked these comment suggestions. They are indeed more helpful.
pkg/validation/validate_cluster.go
Outdated
@@ -238,6 +243,9 @@ func (v *ValidationCluster) collectPodFailures(ctx context.Context, client kuber | |||
if pod.Status.Phase == v1.PodSucceeded { | |||
return nil | |||
} | |||
|
|||
// pod validationErrors usually do not have instanceGroup field as all pod errors caught | |||
// are system critical pods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't true: some pod errors are node critical pods, not system critical pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I inferred this from line 240 containing the following code:
if priority != "system-cluster-critical" && priority != "system-node-critical" {
return nil
}
But I think I wrongly read system-node-critical. Need to brainstorm on how to get the InstanceGroup for the pods which doesn't seem to be straightforward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned previously, this can be deferred to a later PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright. So we can leave the InstanceGroup field as nil for validation error for kind Pods.
} | ||
|
||
func (igErrorValidator *instanceGroupNodeSpecificErrorClusterValidator) Validate() (*validation.ValidationCluster, error) { | ||
instanceGroup := igErrorValidator.Groups[igErrorValidator.NodeGroup].InstanceGroup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be simpler if the instanceGroupNodeSpecificErrorClusterValidator
struct only contained an InstanceGroup
field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alright. Will do that. It makes sense.
func (igErrorValidator *instanceGroupNodeSpecificErrorClusterValidator) Validate() (*validation.ValidationCluster, error) { | ||
instanceGroup := igErrorValidator.Groups[igErrorValidator.NodeGroup].InstanceGroup | ||
if instanceGroup.IsMaster() || instanceGroup.IsBastion() { | ||
instanceGroup = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is testing the mock, not the code under test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better if we include the Validate logic in the Validate mock? I thought just mocking the failures should have been enough to test this. Since we are mostly concerned about the failures return by Validate() and not how it returns it. Am I understanding something wrongly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it was the rolling update code that was responsible for ensuring failures for IGs of role master still blocked rolling update, then testing that in the rolling updates code would make sense. In the current design, where it is the validation code that is responsible for that logic, that logic should be tested by the validation unit tests, not the rolling update unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. This makes sense. I ll push all logic to determine whether we should wait or not for validation to the rolling update code rather than the validation code. It would make things much easier and simpler
In the ValidationError struct, If the InstanceGroup field is nil then this failure is a cluster wide failure for which rolling update should indeed be blocked. For all validation failures in master IG, the InstanceGroup field is nil. isNotRelatedInstanceGroupError ( Which I will change the name of ) returns false if InstanceGroup field is nil which causes rolling update to be blocked. It would definitely be cleaner if we can have InstanceGroup field filled up for all failures. I mainly did this because it would be a bit difficult to identify the groups for the pods. I ll think more about this and get back. |
I did not see code for ensuring the |
Will be pushing all the logic to rolling update code now. |
@johngmyers working on refining this. Just to confirm, for Pods and ComponentStatuses it's not straightforward to get the InstanceGroup. This will anyways be deferred to a future PR. For these cases, in the rolling update code we will be considering cases where |
d4e762d
to
4016b36
Compare
The InstanceGroup field in ValidationError struct is an optional field meant to indicate the InstanceGroup which has reported that failure. This field either holds a pointer to the instance group which caused the validation error or can be nil which indicates that we were unable to determine the instance group to which this failure should be attributed to. This field is mainly used to identify whether a failure is worth waiting for when validating a particular instance group.
This commit fixes the unit tests for validate_cluster to reflect the addition of the new InstanceGroup field in struct ValidationError
When unrelated instance groups produce validation errors, the instance group being updated produces a failure and is forced to wait for rolling update to continue. This can be avoided as failures in different node instance groups usually don't affect the instance group being affected in any way.
The tests create a cluster with 2 node instance groups and 1 master and bastion instance groups. Only one node instance group requires rolling update. instanceGroupNodeSpecificErrorClusterValidator mocks a validation failure for a given node group. rolling update should not fail if the cluster validator reports an error in an unrelated instance group.
4016b36
to
1e18a5d
Compare
gentle ping on this @johngmyers . Thank you for your help and patience on this PR. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bharath-123, johngmyers The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
When an error occurs during cluster validation during a rolling update, the validation phase waits for a certain period of time before continuing.
We can avoid this wait for certain errors inapplicable to an instance group whose rolling update occurring. eg: When a node in a different instance group is terminated leading to a error regarding instance group target size being lesser than expected. For such errors we do not want rolling update on the instance group to wait for errors on inapplicable instance groups.
Fixes #10009
Really appreciate the review comments and discussions. Working on this PR so far has been a great experience for me.