Fix bug where a node that becomes ready after 2 #3924

vivekbagade · 2021-03-05T15:54:01Z

Fix bug where a node that becomes ready after 2
mins can be treated as unready. Deprecated LongNotStarted

In cases where node n1 would:

Be created at t=0min
Ready condition is true at t=2.5min
Not ready taint is removed at t=3min
the ready node is counted as unready

Tested cases after fix:

Case described above
Nodes not starting even after 15mins still
treated as unready
Nodes created long ago that suddenly become unready are
counted as unready.

vivekbagade · 2021-03-05T15:54:28Z

/assign @MaciekPytel

MaciekPytel · 2021-03-05T16:12:14Z

cluster-autoscaler/clusterstate/clusterstate.go

-		} else if stillStarting := isNodeStillStarting(node); stillStarting && node.CreationTimestamp.Time.Add(MaxNodeStartupTime).Before(currentTime) {
-			current.LongNotStarted++
-		} else if stillStarting {
+		} else if !ready && node.CreationTimestamp.Time.Add(MaxNodeStartupTime).After(currentTime) {


nit: I think with the new logic the order of conditions would be more readable if we handled all unready conditions one after another (like below). The logic is the same and I'm not feeling super strongly about this, it just looks more structured this way.

} else if ready { current.Ready++ } else if node.CreationTimestamp.Time.Add(MaxNodeStartupTime).After(currentTime) { current.NotStarted++ } else { current.Unready++ }

MaciekPytel · 2021-03-05T16:13:16Z

cluster-autoscaler/clusterstate/clusterstate.go

@@ -554,9 +554,7 @@ func (csr *ClusterStateRegistry) updateReadinessStats(currentTime time.Time) {
 		current.Registered++
 		if deletetaint.HasToBeDeletedTaint(node) {
 			current.Deleted++
-		} else if stillStarting := isNodeStillStarting(node); stillStarting && node.CreationTimestamp.Time.Add(MaxNodeStartupTime).Before(currentTime) {
-			current.LongNotStarted++


I think it's better to completely remove LongNotStarted from Readiness if it's unused anyway. It's confusing to keep it in places like upcoming nodes calculation (I know it doesn't change anything since it cannot take value other than 0, but it's likely to surprise someone less familiar with clusterstate).

MaciekPytel · 2021-03-05T16:22:29Z

cluster-autoscaler/clusterstate/clusterstate_test.go

+	err := clusterstate.UpdateNodes([]*apiv1.Node{ng1_1, ng2_1}, nil, now)
+	assert.NoError(t, err)
+	assert.Equal(t, 1, clusterstate.GetClusterReadiness().Unready)
+	assert.Equal(t, 0, clusterstate.GetClusterReadiness().LongNotStarted)


You should assert NotStarted == 0. Non-zero LongNotStarted wouldn't have any practical implications (other than misreporting in configmap and, possibly, metrics). Non-zero NotStarted would result in unready node being treated as upcoming (as discussed offline).

Also maybe assert other readiness states too and Upcoming == 0 (upcoming == 0 is skirting the definition of unittest, but it is the most likely negative consequence of a bug in Readiness calculation and so I think worth checking).

Ya. This was mistake. Fixed

MaciekPytel · 2021-03-05T16:29:07Z

cluster-autoscaler/clusterstate/clusterstate_test.go

@@ -768,57 +838,6 @@ func TestUpdateScaleUp(t *testing.T) {
 	assert.Nil(t, clusterstate.scaleUpRequests["ng1"])
 }

-func TestIsNodeStillStarting(t *testing.T) {


Not really part of this PR, but I noticed that there is no equivalent test for GetReadinessState (which has very similar logic and you effectively use it as a replacement). Maybe instead of deleting this test cut/paste it as a test for GetReadinessState (minus the recent/long part)?

Added tests

MaciekPytel · 2021-03-05T17:22:11Z

cluster-autoscaler/utils/kubernetes/ready_test.go

+			assert.NoError(t, err)
+			assert.Equal(t, tc.expectedResult, isReady)
+		})
+		t.Run("long "+tc.desc, func(t *testing.T) {


I think that part doesn't make sense, since GetReadinessState() doesn't care about age of the nodes? That's what I meant by "(minus the recent/long part)" in my previous comment (sorry for not being more clear).

MaciekPytel · 2021-03-11T17:02:30Z

I think MaxStatusSettingDelayAfterCreation is no longer used. Please remove it.

MaciekPytel · 2021-03-11T17:10:55Z

cluster-autoscaler/clusterstate/clusterstate_test.go

+	}, fakeLogRecorder, newBackoff())
+	err := clusterstate.UpdateNodes([]*apiv1.Node{ng1_1, ng2_1}, nil, now)
+	assert.NoError(t, err)
+	assert.Equal(t, 1, clusterstate.GetClusterReadiness().NotStarted)


nit: Below you assert on both notStarted and ready. Maybe do it here too for consistency?

MaciekPytel · 2021-03-11T17:27:25Z

cluster-autoscaler/utils/kubernetes/ready_test.go

+
+			return node
+		}
+		t.Run("recent "+tc.desc, func(t *testing.T) {


nit: remove "recent" from description? it doesn't really apply in this context

MaciekPytel · 2021-03-11T17:28:19Z

/lgtm
/approve
/hold

Left a few nits. Feel free to remove hold after addressing those.

k8s-ci-robot · 2021-03-11T17:28:30Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MaciekPytel, vivekbagade

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [MaciekPytel,vivekbagade]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

treated as unready. Deprecated LongNotStarted In cases where node n1 would: 1) Be created at t=0min 2) Ready condition is true at t=2.5min 3) Not ready taint is removed at t=3min the ready node is counted as unready Tested cases after fix: 1) Case described above 2) Nodes not starting even after 15mins still treated as unready 3) Nodes created long ago that suddenly become unready are counted as unready.

MaciekPytel · 2021-03-11T17:34:43Z

/lgtm
/hold cancel

Thanks, that was a tricky one.

kubernetes/autoscaler#3924 changed Cluster Autoscaler behavior to mark nodes as unhealthy only if at least 15m passed since node creation time.

…24-upstream-cluster-autoscaler-release-1.20 Automated cherry pick of #3924: Fix bug where a node that becomes ready after 2 mins can be

…ed to 1.20 in kubernetes#4319 The backport included unit tests using a function that changed signature after 1.20. This was not detected before merging because CI is not running correctly on 1.20.

Cluster Autoscaler: fix unit tests after #3924 was backported to 1.20 in #4319

…ick-of-#3924-upstream-cluster-autoscaler-release-1.20 Automated cherry pick of kubernetes#3924: Fix bug where a node that becomes ready after 2 mins can be

* Fix cluster-autoscaler clusterapi sample manifest This commit fixes sample manifest of cluster-autoscaler clusterapi provider. (cherry picked from commit a5fee21) * Adding functionality to cordon the node before destroying it. This helps load balancer to remove the node from healthy hosts (ALB does have this support). This won't fix the issue of 502 completely as there is some time node has to live even after cordoning as to serve In-Flight request but load balancer can be configured to remove Cordon nodes from healthy host list. This feature is enabled by cordon-node-before-terminating flag with default value as false to retain existing behavior. * Set maxAsgNamesPerDescribe to the new maximum value While this was previously effectively limited to 50, `DescribeAutoScalingGroups` now supports fetching 100 ASG per calls on all regions, matching what's documented: https://docs.aws.amazon.com/autoscaling/ec2/APIReference/API_DescribeAutoScalingGroups.html ``` AutoScalingGroupNames.member.N The names of the Auto Scaling groups. By default, you can only specify up to 50 names. You can optionally increase this limit using the MaxRecords parameter. MaxRecords The maximum number of items to return with this call. The default value is 50 and the maximum value is 100. ``` Doubling this halves API calls on large clusters, which should help to prevent throttling. * Break out unmarshal from GenerateEC2InstanceTypes Refactor to allow for optimisation * Optimise GenerateEC2InstanceTypes unmarshal memory usage The pricing json for us-east-1 is currently 129MB. Currently fetching this into memory and parsing results in a large memory footprint on startup, and can lead to the autoscaler being OOMKilled. Change the ReadAll/Unmarshal logic to a stream decoder to significantly reduce the memory use. * use aws sdk to find region * update readme * Update cluster-autoscaler/cloudprovider/aws/README.md Co-authored-by: Guy Templeton <[email protected]> * Merge pull request kubernetes#4274 from kinvolk/imran/cloud-provider-packet-fix Cloud provider[Packet] fixes * Fix bug where a node that becomes ready after 2 mins can be treated as unready. Deprecated LongNotStarted In cases where node n1 would: 1) Be created at t=0min 2) Ready condition is true at t=2.5min 3) Not ready taint is removed at t=3min the ready node is counted as unready Tested cases after fix: 1) Case described above 2) Nodes not starting even after 15mins still treated as unready 3) Nodes created long ago that suddenly become unready are counted as unready. * Improve misleading log Signed-off-by: Sylvain Rabot <[email protected]> * dont proactively decrement azure cache for unregistered nodes * Cluster Autoscaler: fix unit tests after kubernetes#3924 was backported to 1.20 in kubernetes#4319 The backport included unit tests using a function that changed signature after 1.20. This was not detected before merging because CI is not running correctly on 1.20. * Cluster Autoscaler: backport Github Actions CI to 1.20 (kubernetes#4366) * annotate fakeNodes so that cloudprovider implementations can identify them if needed * move annotations to cloudprovider package * fix 1.19 test * remove flaky test that's removed in master * Cluster Autoscaler 1.20.1 * Make arch-specific releases use separate images instead of tags on the same image This seems to be the current convention in k8s. * Cluster Autoscaler: add arch-specific build targets to .gitignore * CA - AWS - Instance List Update 03-10-21 - 1.20 release branch * CA - AWS - Instance List Update 29-10-21 - 1.20 release branch * Cluster-Autoscaler update AWS EC2 instance types with g5, m6 and r6 * CA - AWS Instance List Update - 13/12/21 - 1.20 * Merge pull request kubernetes#4497 from marwanad/add-more-azure-instance-types add more azure instance types * Cluster Autoscaler 1.20.2 * Add `--feature-gates` flag to support scale up on volume limits (CSI migration enabled) Signed-off-by: ialidzhikov <[email protected]> * CA - AWS Cloud Provider - 1.20 Static Instance List Update 02-06-2022 * Cluster Autoscaler - 1.20.3 release * sync_file updates & other changes * Updating vendor against [email protected]:kubernetes/kubernetes.git:e3de62298a730415c5d2ab72607ef6adadd6304d (e3de622) * fixed some declaration errors Co-authored-by: Kubernetes Prow Robot <[email protected]> Co-authored-by: Hidekazu Nakamura <[email protected]> Co-authored-by: atul <[email protected]> Co-authored-by: Benjamin Pineau <[email protected]> Co-authored-by: Adrian Lai <[email protected]> Co-authored-by: darkpssngr <[email protected]> Co-authored-by: Guy Templeton <[email protected]> Co-authored-by: Vivek Bagade <[email protected]> Co-authored-by: Sylvain Rabot <[email protected]> Co-authored-by: Marwan Ahmed <[email protected]> Co-authored-by: Jakub Tużnik <[email protected]> Co-authored-by: GuyTempleton <[email protected]> Co-authored-by: sturman <[email protected]> Co-authored-by: Maciek Pytel <[email protected]> Co-authored-by: ialidzhikov <[email protected]>

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 5, 2021

k8s-ci-robot requested review from Jeffwan and towca March 5, 2021 15:54

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 5, 2021

k8s-ci-robot assigned MaciekPytel Mar 5, 2021

vivekbagade force-pushed the master branch from 120f0b6 to 39e993c Compare March 5, 2021 16:01

MaciekPytel reviewed Mar 5, 2021

View reviewed changes

vivekbagade force-pushed the master branch from 39e993c to 47039e0 Compare March 5, 2021 17:14

MaciekPytel reviewed Mar 5, 2021

View reviewed changes

vivekbagade force-pushed the master branch 2 times, most recently from f9faf1c to 5fed757 Compare March 11, 2021 15:08

MaciekPytel reviewed Mar 11, 2021

View reviewed changes

vivekbagade force-pushed the master branch from 5fed757 to 1f781b9 Compare March 11, 2021 17:19

MaciekPytel reviewed Mar 11, 2021

View reviewed changes

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Mar 11, 2021

vivekbagade force-pushed the master branch from 1f781b9 to 8c592f0 Compare March 11, 2021 17:33

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 11, 2021

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Mar 11, 2021

k8s-ci-robot merged commit 57be08d into kubernetes:master Mar 11, 2021

x13n added a commit to x13n/kubernetes that referenced this pull request Aug 12, 2021

Increase time to wait for nodes to become unready

05beda5

kubernetes/autoscaler#3924 changed Cluster Autoscaler behavior to mark nodes as unhealthy only if at least 15m passed since node creation time.

x13n mentioned this pull request Aug 12, 2021

Increase time to wait for nodes to become unready kubernetes/kubernetes#104322

Merged

tulsluper mentioned this pull request Sep 3, 2021

Cluster Autoscaler patch releases #4251

Closed

matthias50 mentioned this pull request Sep 9, 2021

Automated cherry pick of #3924: Fix bug where a node that becomes ready after 2 mins can be #4319

Merged

k8s-ci-robot added a commit that referenced this pull request Sep 30, 2021

Merge pull request #4365 from towca/jtuznik/4319-fix

bb614da

Cluster Autoscaler: fix unit tests after #3924 was backported to 1.20 in #4319

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bug where a node that becomes ready after 2 #3924

Fix bug where a node that becomes ready after 2 #3924

vivekbagade commented Mar 5, 2021

vivekbagade commented Mar 5, 2021

MaciekPytel Mar 5, 2021

vivekbagade Mar 5, 2021

MaciekPytel Mar 5, 2021

vivekbagade Mar 5, 2021

MaciekPytel Mar 5, 2021

vivekbagade Mar 5, 2021

MaciekPytel Mar 5, 2021

vivekbagade Mar 5, 2021

MaciekPytel Mar 5, 2021

vivekbagade Mar 11, 2021

MaciekPytel commented Mar 11, 2021

MaciekPytel Mar 11, 2021

vivekbagade Mar 11, 2021

MaciekPytel Mar 11, 2021

vivekbagade Mar 11, 2021

MaciekPytel commented Mar 11, 2021

k8s-ci-robot commented Mar 11, 2021

MaciekPytel commented Mar 11, 2021

Fix bug where a node that becomes ready after 2 #3924

Fix bug where a node that becomes ready after 2 #3924

Conversation

vivekbagade commented Mar 5, 2021

vivekbagade commented Mar 5, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MaciekPytel commented Mar 11, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MaciekPytel commented Mar 11, 2021

k8s-ci-robot commented Mar 11, 2021

MaciekPytel commented Mar 11, 2021