Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition in scale down test #5227

Merged
merged 1 commit into from
Sep 30, 2022

Conversation

yaroslava-serdiuk
Copy link
Contributor

Which component this PR applies to?

cluster-autoscaler

What this PR does / why we need it:

Fix race condition in unit test

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 30, 2022
@yaroslava-serdiuk
Copy link
Contributor Author

/assign @x13n

@mwielgus
Copy link
Contributor

Can you explain what was the race condition and how this PR fixes it?

@yaroslava-serdiuk
Copy link
Contributor Author

Tests run in parallel and use the same node groups, however the tests are use different set up.

I suppose the race condition happened when we set cloud provider to the node group:
The logs from race conditions are following:

WARNING: DATA RACE
Write at 0x00c000331c08 by goroutine 3188:
  k8s.io/autoscaler/cluster-autoscaler/cloudprovider/test.(*TestNodeGroup).SetCloudProvider()
      /cluster-autoscaler/cloudprovider/test/test_cloud_provider.go:492 +0xa24
  k8s.io/autoscaler/cluster-autoscaler/core/scaledown/actuation.TestStartDeletionInBatchBasic.func1()
      /cluster-autoscaler/core/scaledown/actuation/actuator_test.go:1046 +0xa0f
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1446 +0x216
  testing.(*T).Run.func1()
      /usr/local/go/src/testing/testing.go:1493 +0x47
 
Previous read at 0x00c000331c08 by goroutine 3187:
  k8s.io/autoscaler/cluster-autoscaler/cloudprovider/test.(*TestNodeGroup).DeleteNodes()
      /cluster-autoscaler/cloudprovider/test/test_cloud_provider.go:402 +0x153
  k8s.io/autoscaler/cluster-autoscaler/core/scaledown/actuation.deleteNodesFromCloudProvider()
      /cluster-autoscaler/core/scaledown/actuation/delete_in_batch.go:156 +0x348
  k8s.io/autoscaler/cluster-autoscaler/core/scaledown/actuation.(*NodeDeletionBatcher).remove.func1()
      /cluster-autoscaler/core/scaledown/actuation/delete_in_batch.go:131 +0xad
  k8s.io/autoscaler/cluster-autoscaler/core/scaledown/actuation.(*NodeDeletionBatcher).remove.func3()
      /cluster-autoscaler/core/scaledown/actuation/delete_in_batch.go:142 +0x74
 
Goroutine 3188 (running) created at:
  testing.(*T).Run()
      /usr/local/go/src/testing/testing.go:1493 +0x75d
  k8s.io/autoscaler/cluster-autoscaler/core/scaledown/actuation.TestStartDeletionInBatchBasic()
      /cluster-autoscaler/core/scaledown/actuation/actuator_test.go:1023 +0x1457
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1446 +0x216
  testing.(*T).Run.func1()
      /usr/local/go/src/testing/testing.go:1493 +0x47
 
Goroutine 3187 (running) created at:
  k8s.io/autoscaler/cluster-autoscaler/core/scaledown/actuation.(*NodeDeletionBatcher).remove()
      /cluster-autoscaler/core/scaledown/actuation/delete_in_batch.go:129 +0x571
  k8s.io/autoscaler/cluster-autoscaler/core/scaledown/actuation.(*NodeDeletionBatcher).AddNode.func1()
      /cluster-autoscaler/core/scaledown/actuation/delete_in_batch.go:90 +0x66
  k8s.io/autoscaler/cluster-autoscaler/core/scaledown/actuation.(*NodeDeletionBatcher).AddNode.func2()
      /cluster-autoscaler/core/scaledown/actuation/delete_in_batch.go:91 +0x58

Copy link
Contributor

@mwielgus mwielgus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 30, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mwielgus, yaroslava-serdiuk

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 30, 2022
@k8s-ci-robot k8s-ci-robot merged commit 9cae42c into kubernetes:master Sep 30, 2022
@x13n
Copy link
Member

x13n commented Sep 30, 2022

These tests don't run in parallel, the suite doesn't specify t.Parallel() anywhere.

@x13n
Copy link
Member

x13n commented Sep 30, 2022

The race condition seems to be in TestCloudProvider - setting the cloudProvider field happens in one goroutine, while it is simultaneously being used from another goroutine (NodeDeletionBatcher).

@x13n
Copy link
Member

x13n commented Sep 30, 2022

With this PR merged (and the actual race condition fixed) we probably could add t.Parallel() now though.

@yaroslava-serdiuk
Copy link
Contributor Author

@x13n , yes, you are right that t.Parallel() is not added, however we can add after this change. I had the conclusion that they run in parallel because of race condition and I will explain why I think so.

The race condition seems to be in TestCloudProvider - setting the cloudProvider field happens in one goroutine, while it is simultaneously being used from another goroutine (NodeDeletionBatcher).

I don't agree with you. Setting TestCloudProvider is happening at the beginning of the loop and the call of actuator.StartDeletion which invokes NodeDeleteBatcher is happening later in the loop, so this two calls are not happening simultaneously if they were called from one test.

My understanding of the failure is following.
As we can see from the error above there is a race condition which happening for the node group: setting cloud provider and calling node deletion. As I explained I don't see how it may happen from one test. However this may happen if one test is completed successfully but the go routine from this test hasn't completed the execution and we started the execution of the following test scenario.

This indeed may happen for test case with failed node deletion, for example for "Node deletion failed for one group two times" case.
The loop

for i := 0; i < wantDeletedNodes; i++ {
				select {
				case ngId := <-deletedResult:
					gotDeletedNodes[ngId]++
				case <-time.After(1 * time.Second):
					t.Errorf("Timeout while waiting for deleted nodes.")
					break
				}
			}

may complete the execution and so the test case is completed, however if NodeDeletionBathcer execute only successful deletion and had 1 or 2 more calls to execute failed deletion the go routine from Batcher will continue to execution. So at this moment we will have two parallel execution of the test scenarios.

So, I think the fix is actually fixing the race condition. If you see something wrong in my explanation or you disagree with me I will be happy to discuss.
Actually I have to admit that the test is not great. In our last discussion with @towca we had conclusion that we should introduce addition test cloud provider in order to mock batching.

@x13n
Copy link
Member

x13n commented Oct 6, 2022

Thanks for the detailed explanation! Yup, this makes sense to me, we shouldn't be seeing this error anymore.

navinjoy pushed a commit to navinjoy/autoscaler that referenced this pull request Oct 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants