Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid waiting on validation during rolling update for inapplicable instance groups #10065

Conversation

bharath-123
Copy link
Contributor

@bharath-123 bharath-123 commented Oct 17, 2020

When an error occurs during cluster validation during a rolling update, the validation phase waits for a certain period of time before continuing.

We can avoid this wait for certain errors inapplicable to an instance group whose rolling update occurring. eg: When a node in a different instance group is terminated leading to a error regarding instance group target size being lesser than expected. For such errors we do not want rolling update on the instance group to wait for errors on inapplicable instance groups.

Fixes #10009

Really appreciate the review comments and discussions. Working on this PR so far has been a great experience for me.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 17, 2020
@k8s-ci-robot
Copy link
Contributor

Welcome @bharath-123!

It looks like this is your first PR to kubernetes/kops 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kops has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @bharath-123. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 17, 2020
@k8s-ci-robot k8s-ci-robot requested review from hakman and zetaab October 17, 2020 19:17
@k8s-ci-robot k8s-ci-robot added area/rolling-update size/M Denotes a PR that changes 30-99 lines, ignoring generated files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 17, 2020
@hakman
Copy link
Member

hakman commented Oct 18, 2020

/cc @olemarkus @johngmyers

@johngmyers
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 18, 2020
@johngmyers johngmyers added the hacktoberfest-accepted Accepted contribution for Hacktoberfest label Oct 18, 2020
@bharath-123
Copy link
Contributor Author

Seems like some static checks and format tests are failing. Should have run gofmt and a bunch of other static checks before raising the PR. Will do so now.

Copy link
Member

@johngmyers johngmyers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need to run make gofmt and fix that duplicate import in order to fix those failed tests.

A useful unit test for rolling update would be to mock in a validation failure for a nodes group that is not selected for update and verify that the rolling update completes regardless.

The validation unit tests should be updated to assert that the instanceGroup field is filled in appropriately.

In collectPodFailures, failures for pods of priority system-node-critical should be marked as being specific to the pod's node's group. These are typically things like CNI DaemonSets. Pods that are system-cluster-critical, on the other hand, are cluster-level failures.

// TODO: Should we check if we have enough time left before the deadline?
if shouldWaitBeforeRetry(result.Failures, group) {
time.Sleep(c.ValidateTickDuration)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a logic error here: when shouldWaitBeforeRetry fails, instead of letting the validation succeed and proceeding with updating the next node it instead goes into a fast spinloop.

When all failures are inapplicable it should go through the code block that increments successCount.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question, even if the failures are inapplicable and we skip the wait period. We definitely want to log the failures right? I think we would need to change the log messages here if I am right. If the validation fails and the failures are inapplicable, showing the failure log messages and a cluster validated messages would be very unintuitive for the sys admins.
I am planning to copy this block of code and add in the else stmt if shouldWaitBeforeRetry fails. I will change up the log messages to indicate that there were failures but they were ignored because they are in a different instance group.

			successCount++
			if successCount >= validateCount {
				klog.Info("Cluster validated.")
				return nil
			}
			klog.Infof("Cluster validated; revalidating in %s to make sure it does not flap.", c.ValidateSuccessDuration)
			time.Sleep(c.ValidateSuccessDuration)
			continue

@@ -154,22 +155,24 @@ func Test_ValidateCloudGroupMissing(t *testing.T) {
Kind: "InstanceGroup",
Name: "node-1",
Message: "InstanceGroup \"node-1\" is missing from the cloud provider",
InstanceGroup: instanceGroup,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: you could have made a smaller change by using instanceGroups[0].

@bharath-123
Copy link
Contributor Author

/retest

@johngmyers
Copy link
Member

If the system-node-critical feature is too complicated, it can be done as a follow-up PR.

@bharath-123
Copy link
Contributor Author

@johngmyers I was not able to figure out how to infer the InstanceGroup for a pod in collectPodFailures? The pod has a NodeName field in it's podspec. But I was not really sure how we could infer the InstanceGroup for a pod from that?

@bharath-123
Copy link
Contributor Author

I will follow this up with the unit test and the code changes that you have suggested. Thank you.

@johngmyers
Copy link
Member

It would be better to edit the commit stream so that the formatting and staticcheck issues were fixed in their original commits.

I like the separation between the first and second commits. The third commit should be combined with the second as it is part of the same logical change.

@bharath-123
Copy link
Contributor Author

If the system-node-critical feature is too complicated, it can be done as a follow-up PR.

Shouldn't be difficult if we can figure out how to infer the instanceGroup from a pod given its structure.

@bharath-123
Copy link
Contributor Author

It would be better to edit the commit stream so that the formatting and staticcheck issues were fixed in their original commits.

I like the separation between the first and second commits. The third commit should be combined with the second as it is part of the same logical change.

yes I agree with this. I will anyways be doing a whole bunch of changes now. So will revise the commit stream.

@johngmyers
Copy link
Member

validateNodes could construct a map of nodes to groups. collectPodFailures already constructs a map of IP address to node, so it could then use that map of nodes to groups to go from pod to group.

@johngmyers
Copy link
Member

The rolling update code should probably remove the inapplicable failures from what it logs.

@@ -25,6 +25,7 @@ import (

"k8s.io/apimachinery/pkg/runtime"
"k8s.io/client-go/tools/pager"
kopsapi "k8s.io/kops/pkg/apis/kops"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should already be imported as kops below, I believe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes @justinsb will be doing this.

@bharath-123
Copy link
Contributor Author

The rolling update code should probably remove the inapplicable failures from what it logs.

Makes sense. I am working on a different implementation of this feature now.

@bharath-123 bharath-123 force-pushed the feature/instancegroup-specific-validation branch 2 times, most recently from 9fe80ab to b9927ac Compare October 25, 2020 11:56
@bharath-123
Copy link
Contributor Author

@johngmyers @justinsb @hakman @olemarkus @zetaab do have a look at this. This is a complete implementation with unit tests added for rolling update. Do review this.

@bharath-123
Copy link
Contributor Author

There are a couple of unused imports and unrelated comments which I had added when I was trying a different implementation. I ll remove them now.

@bharath-123 bharath-123 force-pushed the feature/instancegroup-specific-validation branch from b9927ac to d4e762d Compare October 25, 2020 12:25
@bharath-123
Copy link
Contributor Author

gentle ping @johngmyers @justinsb @hakman

@rifelpet
Copy link
Member

@bharath-123 Feel free to remove [WIP] from the PR title if this is ready for review.

@bharath-123 bharath-123 changed the title [WIP] Avoid waiting on validation during rolling update for inapplicable instance groups Avoid waiting on validation during rolling update for inapplicable instance groups Oct 26, 2020
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 26, 2020
@bharath-123
Copy link
Contributor Author

@bharath-123 Feel free to remove [WIP] from the PR title if this is ready for review.

done

Copy link
Member

@johngmyers johngmyers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see where in the code failures specific to a master IG cause rolling update to be blocked from proceeding.

From comments it seems you intended to put this logic in the validation code. I think it might be simpler to put it in the rolling update code that examines the returned list of failures.

@@ -445,7 +446,7 @@ func (c *RollingUpdateCluster) validateClusterWithTimeout(validateCount int) err
for {
// Note that we validate at least once before checking the timeout, in case the cluster is healthy with a short timeout
result, err := c.ClusterValidator.Validate()
if err == nil && len(result.Failures) == 0 {
if err == nil && (len(result.Failures) == 0 || isNotRelatedInstanceGroupError(result.Failures, group)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The len(result.Failures) == 0 is redundant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. Will remove it.

// TODO: Should we check if we have enough time left before the deadline?
time.Sleep(c.ValidateTickDuration)
}

return fmt.Errorf("cluster did not validate within a duration of %q", c.ValidationTimeout)
}

func isNotRelatedInstanceGroupError(failures []*validation.ValidationError, group *cloudinstances.CloudInstanceGroup) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sense of the negation is confusing. They're also "failures"; "errors" are something else. I would recommend inverting the sense of the return value, calling this something like hasFailureRelevantToGroup()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. makes sense. Will do that.

@@ -105,6 +105,30 @@ func (*erroringClusterValidator) Validate() (*validation.ValidationCluster, erro
return nil, errors.New("testing validation error")
}

// simulates failures in a specific node group in the map of instance groups
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// simulates failures in a specific node group in the map of instance groups
// instanceGroupNodeSpecificErrorClusterValidator simulates failures in a specific node group in the map of instance groups.

Comment on lines 52 to 54
// optional field to indicate which instance group this validation error is coming from
// if InstanceGroup is nil, then the validation error is either a cluster wide error
// or an error in the master instancegroup which indicates a cluster wide error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// optional field to indicate which instance group this validation error is coming from
// if InstanceGroup is nil, then the validation error is either a cluster wide error
// or an error in the master instancegroup which indicates a cluster wide error
// InstanceGroup is an optional field to indicate which instance group this validation error is coming from.
// If nil, then the validation error is either a cluster wide error
// or an error in the master instancegroup which indicates a cluster wide error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acked these comment suggestions. They are indeed more helpful.

@@ -238,6 +243,9 @@ func (v *ValidationCluster) collectPodFailures(ctx context.Context, client kuber
if pod.Status.Phase == v1.PodSucceeded {
return nil
}

// pod validationErrors usually do not have instanceGroup field as all pod errors caught
// are system critical pods.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't true: some pod errors are node critical pods, not system critical pods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I inferred this from line 240 containing the following code:

		if priority != "system-cluster-critical" && priority != "system-node-critical" {
			return nil
		}

But I think I wrongly read system-node-critical. Need to brainstorm on how to get the InstanceGroup for the pods which doesn't seem to be straightforward.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned previously, this can be deferred to a later PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright. So we can leave the InstanceGroup field as nil for validation error for kind Pods.

}

func (igErrorValidator *instanceGroupNodeSpecificErrorClusterValidator) Validate() (*validation.ValidationCluster, error) {
instanceGroup := igErrorValidator.Groups[igErrorValidator.NodeGroup].InstanceGroup
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be simpler if the instanceGroupNodeSpecificErrorClusterValidator struct only contained an InstanceGroup field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alright. Will do that. It makes sense.

func (igErrorValidator *instanceGroupNodeSpecificErrorClusterValidator) Validate() (*validation.ValidationCluster, error) {
instanceGroup := igErrorValidator.Groups[igErrorValidator.NodeGroup].InstanceGroup
if instanceGroup.IsMaster() || instanceGroup.IsBastion() {
instanceGroup = nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is testing the mock, not the code under test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better if we include the Validate logic in the Validate mock? I thought just mocking the failures should have been enough to test this. Since we are mostly concerned about the failures return by Validate() and not how it returns it. Am I understanding something wrongly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it was the rolling update code that was responsible for ensuring failures for IGs of role master still blocked rolling update, then testing that in the rolling updates code would make sense. In the current design, where it is the validation code that is responsible for that logic, that logic should be tested by the validation unit tests, not the rolling update unit tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. This makes sense. I ll push all logic to determine whether we should wait or not for validation to the rolling update code rather than the validation code. It would make things much easier and simpler

@bharath-123
Copy link
Contributor Author

I don't see where in the code failures specific to a master IG cause rolling update to be blocked from proceeding.

From comments it seems you intended to put this logic in the validation code. I think it might be simpler to put it in the rolling update code that examines the returned list of failures.

I don't see where in the code failures specific to a master IG cause rolling update to be blocked from proceeding.

From comments it seems you intended to put this logic in the validation code. I think it might be simpler to put it in the rolling update code that examines the returned list of failures.

In the ValidationError struct, If the InstanceGroup field is nil then this failure is a cluster wide failure for which rolling update should indeed be blocked. For all validation failures in master IG, the InstanceGroup field is nil. isNotRelatedInstanceGroupError ( Which I will change the name of ) returns false if InstanceGroup field is nil which causes rolling update to be blocked.

It would definitely be cleaner if we can have InstanceGroup field filled up for all failures. I mainly did this because it would be a bit difficult to identify the groups for the pods. I ll think more about this and get back.

@johngmyers
Copy link
Member

For all validation failures in master IG, the InstanceGroup field is nil.

I did not see code for ensuring the InstanceGroup field is nil for failures in IGs of role master.

@bharath-123
Copy link
Contributor Author

For all validation failures in master IG, the InstanceGroup field is nil.

I did not see code for ensuring the InstanceGroup field is nil for failures in IGs of role master.

Will be pushing all the logic to rolling update code now.

@bharath-123
Copy link
Contributor Author

@johngmyers working on refining this. Just to confirm, for Pods and ComponentStatuses it's not straightforward to get the InstanceGroup. This will anyways be deferred to a future PR.

For these cases, in the rolling update code we will be considering cases where InstanceGroup == nil to be a relevant InstanceGroup failure. Would I need to assert these cases in the validate_cluster tests? In the rolling update test, I ll include a test to fail rolling update if a validation error is returned if InstanceGroup is nil.

@bharath-123 bharath-123 force-pushed the feature/instancegroup-specific-validation branch from d4e762d to 4016b36 Compare October 31, 2020 13:34
The InstanceGroup field in ValidationError struct is an optional field meant
to indicate the InstanceGroup which has reported that failure. This field either
holds a pointer to the instance group which caused the validation error or can be
nil which indicates that we were unable to determine the instance group to which
this failure should be attributed to.

This field is mainly used to identify whether a failure is worth waiting for
when validating a particular instance group.
This commit fixes the unit tests for validate_cluster to reflect the addition of the new
InstanceGroup field in struct ValidationError
When unrelated instance groups produce validation errors, the instance group
being updated produces a failure and is forced to wait for rolling update to continue.

This can be avoided as failures in different node instance groups usually don't affect
the instance group being affected in any way.
The tests create a cluster with 2 node instance groups and 1 master and bastion instance groups.
Only one node instance group requires rolling update.

instanceGroupNodeSpecificErrorClusterValidator mocks a validation failure for a given node group.
rolling update should not fail if the cluster validator reports an error in an unrelated instance group.
@bharath-123 bharath-123 force-pushed the feature/instancegroup-specific-validation branch from 4016b36 to 1e18a5d Compare October 31, 2020 13:51
@bharath-123
Copy link
Contributor Author

gentle ping on this @johngmyers . Thank you for your help and patience on this PR.

@johngmyers
Copy link
Member

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 6, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bharath-123, johngmyers

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 6, 2020
@k8s-ci-robot k8s-ci-robot merged commit 7b26ec4 into kubernetes:master Nov 6, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.20 milestone Nov 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/rolling-update cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. hacktoberfest-accepted Accepted contribution for Hacktoberfest lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rolling update ignore unready nodes in inapplicable instance groups
6 participants