Introduce NodeDeleterBatcher to ScaleDown actuator #5060

yaroslava-serdiuk · 2022-07-28T15:52:45Z

Which component this PR applies to?

/cluster-autoscaler

What type of PR is this?

/kind feature

What this PR does / why we need it:

Optimize Scale down by deleting nodes in batch. The --node-deletion-in-batch-interval represent the time that NodeDeleterBatcher will gather nodes for one node group and after that time will call nodeGroup.DeleteNodes(nodes). If the flag is unset, the NodeDeleterBatcher won't wait for other nodes and immediately start node deletion.

ScaleDown in batch: --node-deletion-in-batch-interval flag represent how long CA ScaleDown gather nodes to delete them in batch.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

cluster-autoscaler/core/scaledown/actuation/delete_in_batch.go

cluster-autoscaler/config/autoscaling_options.go

cluster-autoscaler/core/scaledown/actuation/actuator.go

cluster-autoscaler/core/scaledown/deletionbatcher/node_deletion_batcher.go

cluster-autoscaler/core/scaledown/actuation/delete_in_batch.go

cluster-autoscaler/core/scaledown/actuation/actuator.go

cluster-autoscaler/processors/status/scale_down_status_processor.go

cluster-autoscaler/core/scaledown/actuation/delete_in_batch.go

yaroslava-serdiuk · 2022-08-30T09:13:01Z

/hold

yaroslava-serdiuk · 2022-09-01T18:45:27Z

/unhold

I updated the code a bit:

added drained nodes map to the batcher directly and so I removed methods that I added to nodeDeleterTracker
I also modified AddNode() method of the batcher. Previously it returned nodeGroupId only if the node was the first for current nodeGroup, now I return nodeGroupId always and add additional bool parameter which indicate if the node is the first for current batch.

cluster-autoscaler/main.go

cluster-autoscaler/core/scaledown/actuation/actuator.go

towca · 2022-09-07T12:50:45Z

cluster-autoscaler/core/scaledown/actuation/actuator.go

+		RecordFailedScaleDownEvent(node, drain, a.ctx.Recorder, "prepareNodeForDelition failed", status.Err)
+		_, _ = deletetaint.CleanToBeDeleted(node, a.ctx.ClientSet, a.ctx.CordonNodeBeforeTerminate)
+		nodeGroup, err := a.ctx.CloudProvider.NodeGroupForNode(node)
+		if err != nil {


What do you think about just passing the node group id to scheduleDeletion and prepareForDeletion? We have the id ready whenever we're calling scheduleDeletion anyway. This would mean we wouldn't have to call NodeGroupForNode here and in prepareForDeletion separately, and we would be able to handle the error better here.

Actually we will call NodeGroupForNode() when actual node deletion happens. So if NodeGroupForNode() return nil I think we may have an error during deletion and probably inside batch as well.
For scheduleDeletion() I would pass nodeGroupId, but for prepareNodeForDeletion we don't need nodeGroupId, so I would leave it as it is to have a middle check for NodeGroupForNode().

cluster-autoscaler/core/scaledown/actuation/actuator.go

towca · 2022-09-07T16:28:37Z

cluster-autoscaler/core/scaledown/actuation/actuator_test.go

 	toBeDeletedTaint := apiv1.Taint{Key: deletetaint.ToBeDeletedTaint, Effect: apiv1.TaintEffectNoSchedule}
+	testNg := testprovider.NewTestNodeGroup("test-ng", 0, 100, 3, true, false, "n1-standard-2", nil, nil)
+	emptyNodeNodeGroup := generateNodesAndNodeGroupMap(4, "empty")


Why is it important to switch from 1 node group for all nodes to 1 node group per each node? Just to test a more comprehensive scenario as node groups actually matter now, or is there another reason?

We have one error message for all nodes from node group when there is failed deletion scenario for scale down with batcher. Also, if one deletion failed in one node group it may result of failed deletion for another node group.
However, here we have batch with 0 second batching interval, so the nodes should be deleted one by one. But this is not always true, since we lock batcher in addNodeToBucket() and remove(), so actually we may add few nodes before remove() call.
And this is what happening time to time, the current implementation make the test flaky. I fixed the test, to not be flaky, but now I'm not sure if it's the right approach. I would like to have an old behaviour (i.e. no batching) when the batch interval is 0 seconds.

cluster-autoscaler/core/scaledown/actuation/actuator_test.go

towca · 2022-09-07T17:19:13Z

cluster-autoscaler/core/scaledown/actuation/actuator_test.go

+					break
+				}
+			}
+			if diff := cmp.Diff(test.wantSuccessfulDeletion, gotDeletedNodes); diff != "" {


What does this test verify that the test above doesn't? I'd imagine we want to test if the nodes actually get batched, but this test doesn't verify that at all, since the cloud provider hook processes nodes one-by-one. I'd try to assert if all nodes from one wave get deleted in the same API call. That will probably require some fiddling around with the cloud provider, but IMO it's needed.

The waves is show the set of nodes for which one we call actuator.StartDeletion, but only nodes that belong to one nodeGroup are deleted in one API call.
The previous test is testing batcher with 0 seconds interval and this test is testing batcher with >0 seconds interval. So this test is testing the e2e deletion. The tests in Batches are aimed to test the correctness of batching behaviour.

The whole reason we're doing batching is to delete multiple nodes in one API call. This test doesn't verify that, it just verifies that all nodes do get deleted. It would still pass without the batcher change. To test this properly, you'd have to modify the test cloud provider (or create a custom one), that would hook either on GkeMig.DeleteNodes(nodes []*apiv1.Node), or gkeManager.DeleteInstances(instances []gce.GceRef), which would allow us to verify if the nodes got deleted in 1 call. It's a big change though and this review has already gone for long enough, so let's discuss it in a follow-up.

cluster-autoscaler/core/scaledown/actuation/delete_in_batch_test.go

towca · 2022-09-08T09:54:18Z

cluster-autoscaler/core/scaledown/actuation/delete_in_batch_test.go

+
+func TestCleanUp(t *testing.T) {
+	provider := testprovider.NewTestCloudProvider(nil, func(id, nodeName string) error {
+		return nil


Don't we also want to verify that the nodes actually get deleted, and in batch?

I added gotDeletedNodes slice

towca · 2022-09-08T10:04:16Z

cluster-autoscaler/core/scaledown/actuation/delete_in_batch_test.go

+	}
+	batches := len(d.deletionsPerNodeGroup)
+
+	// CleanUp NodeGroup that is not present in bucket.


Why not make this a table-style test as well? I'd like to see a test case where we do addNode, addNode, remove, addNode, remove for the same node group in quick succession (or maybe all but the last remove started at the same time as goroutines, and then a synchronous remove). Then assert that all nodes get deleted, and that the state is as expected at the end (empty deletionsPerNodeGroup, empty drainedNodeDeletions). I know it should work with the current implementation, but I can see somebody changing something in the future that would make us skip some node in this scenario, or leak memory in the structures.

The test now is table-style. This is unit test, that verify remove() call only. The whole batch, i.e AddNode() should be tested in actuator_test, that test the whole functionality.

towca · 2022-09-22T13:12:34Z

cluster-autoscaler/core/scaledown/actuation/actuator.go

-			}
-		}(drainNode)
+		nodeGroup, err := a.ctx.CloudProvider.NodeGroupForNode(drainNode)
+		if err != nil {


We should definitely check nodeGroup for nil here, if we're calling .Id() on it right after this, otherwise we're risking panic. And if that check is here already, I suppose it serves no additional purpose in prepareNodeForDeletion, right?

cluster-autoscaler/core/scaledown/actuation/delete_in_batch.go

towca · 2022-09-22T13:30:45Z

cluster-autoscaler/core/scaledown/actuation/actuator_test.go

+					break
+				}
+			}
+			if diff := cmp.Diff(test.wantSuccessfulDeletion, gotDeletedNodes); diff != "" {


The whole reason we're doing batching is to delete multiple nodes in one API call. This test doesn't verify that, it just verifies that all nodes do get deleted. It would still pass without the batcher change. To test this properly, you'd have to modify the test cloud provider (or create a custom one), that would hook either on GkeMig.DeleteNodes(nodes []*apiv1.Node), or gkeManager.DeleteInstances(instances []gce.GceRef), which would allow us to verify if the nodes got deleted in 1 call. It's a big change though and this review has already gone for long enough, so let's discuss it in a follow-up.

cluster-autoscaler/core/scaledown/actuation/actuator_test.go

cluster-autoscaler/core/scaledown/actuation/delete_in_batch_test.go

towca · 2022-09-22T13:43:51Z

cluster-autoscaler/core/scaledown/actuation/delete_in_batch_test.go

+			t.Errorf("%s: remove() return error, but shouldn't", test.name)
+		}
+		gotDeletedNodes := []string{}
+		for i := 0; i < test.numNodes; i++ {


Shouldn't this loop also select for nodDeletedNodes? E.g. in the Unsuccessful remove case test.NumNodes=5, but deletedNodes will only receive 4 confirmations - since 1 node is notifying a different channel. Wouldn't this run into the timeout case?

Right, the Unsuccessful remove case hasn't run because of return statement, now it fixed.
If we have failed deletion the following nodes are not deleted, so I check this deleted nodes only in success case and otherwise I check at least one failure.

cluster-autoscaler/core/scaledown/actuation/delete_in_batch_test.go

towca · 2022-09-22T16:53:56Z

/lgtm
/approve
/hold

Thanks for all the changes!! Just one more nit, feel free to unhold if you don't agree or prefer to address it in a follow-up.

k8s-ci-robot · 2022-09-22T16:54:09Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: towca, yaroslava-serdiuk

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [towca]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

towca · 2022-09-22T16:58:05Z

/unhold

…-batch Introduce NodeDeleterBatcher to ScaleDown actuator

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jul 28, 2022

k8s-ci-robot requested review from aleksandra-malinowska and feiskyer July 28, 2022 15:53

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 28, 2022

yaroslava-serdiuk force-pushed the deleting-in-batch branch 2 times, most recently from f1bed8d to 3ae4ab9 Compare July 29, 2022 07:52

jbartosik added the area/cluster-autoscaler label Aug 2, 2022

towca requested changes Aug 18, 2022

View reviewed changes

yaroslava-serdiuk force-pushed the deleting-in-batch branch 4 times, most recently from 54efed5 to ebca4f9 Compare August 29, 2022 13:52

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 30, 2022

yaroslava-serdiuk force-pushed the deleting-in-batch branch from ebca4f9 to 561658d Compare September 1, 2022 18:38

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 1, 2022

yaroslava-serdiuk force-pushed the deleting-in-batch branch from 561658d to 747e8a3 Compare September 2, 2022 10:58

towca reviewed Sep 7, 2022

View reviewed changes

towca reviewed Sep 8, 2022

View reviewed changes

yaroslava-serdiuk force-pushed the deleting-in-batch branch 6 times, most recently from fec0a90 to d5fd28c Compare September 21, 2022 14:34

towca reviewed Sep 22, 2022

View reviewed changes

Introduce NodeDeleterBatcher to ScaleDown actuator

65b0d78

yaroslava-serdiuk force-pushed the deleting-in-batch branch from d5fd28c to 65b0d78 Compare September 22, 2022 16:20

towca reviewed Sep 22, 2022

View reviewed changes

cluster-autoscaler/core/scaledown/actuation/delete_in_batch_test.go Show resolved Hide resolved

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 22, 2022

k8s-ci-robot assigned towca Sep 22, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 22, 2022

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 22, 2022

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 22, 2022

k8s-ci-robot merged commit b3c6b60 into kubernetes:master Sep 22, 2022

navinjoy pushed a commit to navinjoy/autoscaler that referenced this pull request Oct 26, 2022

Merge pull request kubernetes#5060 from yaroslava-serdiuk/deleting-in…

0c490ed

…-batch Introduce NodeDeleterBatcher to ScaleDown actuator

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce NodeDeleterBatcher to ScaleDown actuator #5060

Introduce NodeDeleterBatcher to ScaleDown actuator #5060

yaroslava-serdiuk commented Jul 28, 2022 •

edited

Loading

yaroslava-serdiuk commented Aug 30, 2022

yaroslava-serdiuk commented Sep 1, 2022

towca Sep 7, 2022

yaroslava-serdiuk Sep 16, 2022

towca Sep 7, 2022

yaroslava-serdiuk Sep 18, 2022

towca Sep 7, 2022

yaroslava-serdiuk Sep 21, 2022

towca Sep 22, 2022

towca Sep 8, 2022

yaroslava-serdiuk Sep 21, 2022

towca Sep 8, 2022

yaroslava-serdiuk Sep 21, 2022

towca Sep 22, 2022

yaroslava-serdiuk Sep 22, 2022

towca Sep 22, 2022

towca Sep 22, 2022

yaroslava-serdiuk Sep 22, 2022

towca commented Sep 22, 2022

k8s-ci-robot commented Sep 22, 2022

towca commented Sep 22, 2022

Introduce NodeDeleterBatcher to ScaleDown actuator #5060

Introduce NodeDeleterBatcher to ScaleDown actuator #5060

Conversation

yaroslava-serdiuk commented Jul 28, 2022 • edited Loading

Which component this PR applies to?

What type of PR is this?

What this PR does / why we need it:

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

yaroslava-serdiuk commented Aug 30, 2022

yaroslava-serdiuk commented Sep 1, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

towca commented Sep 22, 2022

k8s-ci-robot commented Sep 22, 2022

towca commented Sep 22, 2022

yaroslava-serdiuk commented Jul 28, 2022 •

edited

Loading