Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce NodeDeleterBatcher to ScaleDown actuator #5060

Merged
merged 1 commit into from
Sep 22, 2022

Conversation

yaroslava-serdiuk
Copy link
Contributor

@yaroslava-serdiuk yaroslava-serdiuk commented Jul 28, 2022

Which component this PR applies to?

/cluster-autoscaler

What type of PR is this?

/kind feature

What this PR does / why we need it:

Optimize Scale down by deleting nodes in batch. The --node-deletion-in-batch-interval represent the time that NodeDeleterBatcher will gather nodes for one node group and after that time will call nodeGroup.DeleteNodes(nodes). If the flag is unset, the NodeDeleterBatcher won't wait for other nodes and immediately start node deletion.

ScaleDown in batch: --node-deletion-in-batch-interval flag represent how long CA ScaleDown gather nodes to delete them in batch.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jul 28, 2022
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 28, 2022
@yaroslava-serdiuk yaroslava-serdiuk force-pushed the deleting-in-batch branch 2 times, most recently from f1bed8d to 3ae4ab9 Compare July 29, 2022 07:52
cluster-autoscaler/config/autoscaling_options.go Outdated Show resolved Hide resolved
cluster-autoscaler/core/scaledown/actuation/actuator.go Outdated Show resolved Hide resolved
cluster-autoscaler/core/scaledown/actuation/actuator.go Outdated Show resolved Hide resolved
cluster-autoscaler/core/scaledown/actuation/actuator.go Outdated Show resolved Hide resolved
@yaroslava-serdiuk yaroslava-serdiuk force-pushed the deleting-in-batch branch 4 times, most recently from 54efed5 to ebca4f9 Compare August 29, 2022 13:52
@yaroslava-serdiuk
Copy link
Contributor Author

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 30, 2022
@yaroslava-serdiuk
Copy link
Contributor Author

/unhold

I updated the code a bit:

  • added drained nodes map to the batcher directly and so I removed methods that I added to nodeDeleterTracker
  • I also modified AddNode() method of the batcher. Previously it returned nodeGroupId only if the node was the first for current nodeGroup, now I return nodeGroupId always and add additional bool parameter which indicate if the node is the first for current batch.

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 1, 2022
cluster-autoscaler/main.go Outdated Show resolved Hide resolved
cluster-autoscaler/core/scaledown/actuation/actuator.go Outdated Show resolved Hide resolved
cluster-autoscaler/core/scaledown/actuation/actuator.go Outdated Show resolved Hide resolved
RecordFailedScaleDownEvent(node, drain, a.ctx.Recorder, "prepareNodeForDelition failed", status.Err)
_, _ = deletetaint.CleanToBeDeleted(node, a.ctx.ClientSet, a.ctx.CordonNodeBeforeTerminate)
nodeGroup, err := a.ctx.CloudProvider.NodeGroupForNode(node)
if err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about just passing the node group id to scheduleDeletion and prepareForDeletion? We have the id ready whenever we're calling scheduleDeletion anyway. This would mean we wouldn't have to call NodeGroupForNode here and in prepareForDeletion separately, and we would be able to handle the error better here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we will call NodeGroupForNode() when actual node deletion happens. So if NodeGroupForNode() return nil I think we may have an error during deletion and probably inside batch as well.
For scheduleDeletion() I would pass nodeGroupId, but for prepareNodeForDeletion we don't need nodeGroupId, so I would leave it as it is to have a middle check for NodeGroupForNode().

toBeDeletedTaint := apiv1.Taint{Key: deletetaint.ToBeDeletedTaint, Effect: apiv1.TaintEffectNoSchedule}
testNg := testprovider.NewTestNodeGroup("test-ng", 0, 100, 3, true, false, "n1-standard-2", nil, nil)
emptyNodeNodeGroup := generateNodesAndNodeGroupMap(4, "empty")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it important to switch from 1 node group for all nodes to 1 node group per each node? Just to test a more comprehensive scenario as node groups actually matter now, or is there another reason?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have one error message for all nodes from node group when there is failed deletion scenario for scale down with batcher. Also, if one deletion failed in one node group it may result of failed deletion for another node group.
However, here we have batch with 0 second batching interval, so the nodes should be deleted one by one. But this is not always true, since we lock batcher in addNodeToBucket() and remove(), so actually we may add few nodes before remove() call.
And this is what happening time to time, the current implementation make the test flaky. I fixed the test, to not be flaky, but now I'm not sure if it's the right approach. I would like to have an old behaviour (i.e. no batching) when the batch interval is 0 seconds.

break
}
}
if diff := cmp.Diff(test.wantSuccessfulDeletion, gotDeletedNodes); diff != "" {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this test verify that the test above doesn't? I'd imagine we want to test if the nodes actually get batched, but this test doesn't verify that at all, since the cloud provider hook processes nodes one-by-one. I'd try to assert if all nodes from one wave get deleted in the same API call. That will probably require some fiddling around with the cloud provider, but IMO it's needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The waves is show the set of nodes for which one we call actuator.StartDeletion, but only nodes that belong to one nodeGroup are deleted in one API call.
The previous test is testing batcher with 0 seconds interval and this test is testing batcher with >0 seconds interval. So this test is testing the e2e deletion. The tests in Batches are aimed to test the correctness of batching behaviour.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole reason we're doing batching is to delete multiple nodes in one API call. This test doesn't verify that, it just verifies that all nodes do get deleted. It would still pass without the batcher change. To test this properly, you'd have to modify the test cloud provider (or create a custom one), that would hook either on GkeMig.DeleteNodes(nodes []*apiv1.Node), or gkeManager.DeleteInstances(instances []gce.GceRef), which would allow us to verify if the nodes got deleted in 1 call. It's a big change though and this review has already gone for long enough, so let's discuss it in a follow-up.


func TestCleanUp(t *testing.T) {
provider := testprovider.NewTestCloudProvider(nil, func(id, nodeName string) error {
return nil
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we also want to verify that the nodes actually get deleted, and in batch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added gotDeletedNodes slice

}
batches := len(d.deletionsPerNodeGroup)

// CleanUp NodeGroup that is not present in bucket.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not make this a table-style test as well? I'd like to see a test case where we do addNode, addNode, remove, addNode, remove for the same node group in quick succession (or maybe all but the last remove started at the same time as goroutines, and then a synchronous remove). Then assert that all nodes get deleted, and that the state is as expected at the end (empty deletionsPerNodeGroup, empty drainedNodeDeletions). I know it should work with the current implementation, but I can see somebody changing something in the future that would make us skip some node in this scenario, or leak memory in the structures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test now is table-style. This is unit test, that verify remove() call only. The whole batch, i.e AddNode() should be tested in actuator_test, that test the whole functionality.

@yaroslava-serdiuk yaroslava-serdiuk force-pushed the deleting-in-batch branch 6 times, most recently from fec0a90 to d5fd28c Compare September 21, 2022 14:34
}
}(drainNode)
nodeGroup, err := a.ctx.CloudProvider.NodeGroupForNode(drainNode)
if err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should definitely check nodeGroup for nil here, if we're calling .Id() on it right after this, otherwise we're risking panic. And if that check is here already, I suppose it serves no additional purpose in prepareNodeForDeletion, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right

break
}
}
if diff := cmp.Diff(test.wantSuccessfulDeletion, gotDeletedNodes); diff != "" {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole reason we're doing batching is to delete multiple nodes in one API call. This test doesn't verify that, it just verifies that all nodes do get deleted. It would still pass without the batcher change. To test this properly, you'd have to modify the test cloud provider (or create a custom one), that would hook either on GkeMig.DeleteNodes(nodes []*apiv1.Node), or gkeManager.DeleteInstances(instances []gce.GceRef), which would allow us to verify if the nodes got deleted in 1 call. It's a big change though and this review has already gone for long enough, so let's discuss it in a follow-up.

t.Errorf("%s: remove() return error, but shouldn't", test.name)
}
gotDeletedNodes := []string{}
for i := 0; i < test.numNodes; i++ {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this loop also select for nodDeletedNodes? E.g. in the Unsuccessful remove case test.NumNodes=5, but deletedNodes will only receive 4 confirmations - since 1 node is notifying a different channel. Wouldn't this run into the timeout case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the Unsuccessful remove case hasn't run because of return statement, now it fixed.
If we have failed deletion the following nodes are not deleted, so I check this deleted nodes only in success case and otherwise I check at least one failure.

@towca
Copy link
Collaborator

towca commented Sep 22, 2022

/lgtm
/approve
/hold

Thanks for all the changes!! Just one more nit, feel free to unhold if you don't agree or prefer to address it in a follow-up.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 22, 2022
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 22, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: towca, yaroslava-serdiuk

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 22, 2022
@towca
Copy link
Collaborator

towca commented Sep 22, 2022

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 22, 2022
@k8s-ci-robot k8s-ci-robot merged commit b3c6b60 into kubernetes:master Sep 22, 2022
navinjoy pushed a commit to navinjoy/autoscaler that referenced this pull request Oct 26, 2022
…-batch

Introduce NodeDeleterBatcher to ScaleDown actuator
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants