Fix race condition in scale down test #5227

yaroslava-serdiuk · 2022-09-30T08:56:37Z

Which component this PR applies to?

cluster-autoscaler

What this PR does / why we need it:

Fix race condition in unit test

yaroslava-serdiuk · 2022-09-30T08:57:02Z

/assign @x13n

mwielgus · 2022-09-30T10:03:12Z

Can you explain what was the race condition and how this PR fixes it?

yaroslava-serdiuk · 2022-09-30T10:12:44Z

Tests run in parallel and use the same node groups, however the tests are use different set up.

I suppose the race condition happened when we set cloud provider to the node group:
The logs from race conditions are following:

WARNING: DATA RACE
Write at 0x00c000331c08 by goroutine 3188:
  k8s.io/autoscaler/cluster-autoscaler/cloudprovider/test.(*TestNodeGroup).SetCloudProvider()
      /cluster-autoscaler/cloudprovider/test/test_cloud_provider.go:492 +0xa24
  k8s.io/autoscaler/cluster-autoscaler/core/scaledown/actuation.TestStartDeletionInBatchBasic.func1()
      /cluster-autoscaler/core/scaledown/actuation/actuator_test.go:1046 +0xa0f
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1446 +0x216
  testing.(*T).Run.func1()
      /usr/local/go/src/testing/testing.go:1493 +0x47
 
Previous read at 0x00c000331c08 by goroutine 3187:
  k8s.io/autoscaler/cluster-autoscaler/cloudprovider/test.(*TestNodeGroup).DeleteNodes()
      /cluster-autoscaler/cloudprovider/test/test_cloud_provider.go:402 +0x153
  k8s.io/autoscaler/cluster-autoscaler/core/scaledown/actuation.deleteNodesFromCloudProvider()
      /cluster-autoscaler/core/scaledown/actuation/delete_in_batch.go:156 +0x348
  k8s.io/autoscaler/cluster-autoscaler/core/scaledown/actuation.(*NodeDeletionBatcher).remove.func1()
      /cluster-autoscaler/core/scaledown/actuation/delete_in_batch.go:131 +0xad
  k8s.io/autoscaler/cluster-autoscaler/core/scaledown/actuation.(*NodeDeletionBatcher).remove.func3()
      /cluster-autoscaler/core/scaledown/actuation/delete_in_batch.go:142 +0x74
 
Goroutine 3188 (running) created at:
  testing.(*T).Run()
      /usr/local/go/src/testing/testing.go:1493 +0x75d
  k8s.io/autoscaler/cluster-autoscaler/core/scaledown/actuation.TestStartDeletionInBatchBasic()
      /cluster-autoscaler/core/scaledown/actuation/actuator_test.go:1023 +0x1457
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1446 +0x216
  testing.(*T).Run.func1()
      /usr/local/go/src/testing/testing.go:1493 +0x47
 
Goroutine 3187 (running) created at:
  k8s.io/autoscaler/cluster-autoscaler/core/scaledown/actuation.(*NodeDeletionBatcher).remove()
      /cluster-autoscaler/core/scaledown/actuation/delete_in_batch.go:129 +0x571
  k8s.io/autoscaler/cluster-autoscaler/core/scaledown/actuation.(*NodeDeletionBatcher).AddNode.func1()
      /cluster-autoscaler/core/scaledown/actuation/delete_in_batch.go:90 +0x66
  k8s.io/autoscaler/cluster-autoscaler/core/scaledown/actuation.(*NodeDeletionBatcher).AddNode.func2()
      /cluster-autoscaler/core/scaledown/actuation/delete_in_batch.go:91 +0x58

mwielgus

/lgtm
/approve

k8s-ci-robot · 2022-09-30T11:57:23Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mwielgus, yaroslava-serdiuk

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [mwielgus]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

x13n · 2022-09-30T12:21:09Z

These tests don't run in parallel, the suite doesn't specify t.Parallel() anywhere.

x13n · 2022-09-30T12:24:15Z

The race condition seems to be in TestCloudProvider - setting the cloudProvider field happens in one goroutine, while it is simultaneously being used from another goroutine (NodeDeletionBatcher).

x13n · 2022-09-30T12:28:38Z

With this PR merged (and the actual race condition fixed) we probably could add t.Parallel() now though.

yaroslava-serdiuk · 2022-10-04T08:44:31Z

@x13n , yes, you are right that t.Parallel() is not added, however we can add after this change. I had the conclusion that they run in parallel because of race condition and I will explain why I think so.

The race condition seems to be in TestCloudProvider - setting the cloudProvider field happens in one goroutine, while it is simultaneously being used from another goroutine (NodeDeletionBatcher).

I don't agree with you. Setting TestCloudProvider is happening at the beginning of the loop and the call of actuator.StartDeletion which invokes NodeDeleteBatcher is happening later in the loop, so this two calls are not happening simultaneously if they were called from one test.

My understanding of the failure is following.
As we can see from the error above there is a race condition which happening for the node group: setting cloud provider and calling node deletion. As I explained I don't see how it may happen from one test. However this may happen if one test is completed successfully but the go routine from this test hasn't completed the execution and we started the execution of the following test scenario.

This indeed may happen for test case with failed node deletion, for example for "Node deletion failed for one group two times" case.
The loop

for i := 0; i < wantDeletedNodes; i++ {
				select {
				case ngId := <-deletedResult:
					gotDeletedNodes[ngId]++
				case <-time.After(1 * time.Second):
					t.Errorf("Timeout while waiting for deleted nodes.")
					break
				}
			}

may complete the execution and so the test case is completed, however if NodeDeletionBathcer execute only successful deletion and had 1 or 2 more calls to execute failed deletion the go routine from Batcher will continue to execution. So at this moment we will have two parallel execution of the test scenarios.

So, I think the fix is actually fixing the race condition. If you see something wrong in my explanation or you disagree with me I will be happy to discuss.
Actually I have to admit that the test is not great. In our last discussion with @towca we had conclusion that we should introduce addition test cloud provider in order to mock batching.

x13n · 2022-10-06T09:29:24Z

Thanks for the detailed explanation! Yup, this makes sense to me, we shouldn't be seeing this error anymore.

Fix race condition in scale down test

Fix race condition in scale down test

a99294d

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 30, 2022

k8s-ci-robot assigned x13n Sep 30, 2022

k8s-ci-robot requested review from aleksandra-malinowska and feiskyer September 30, 2022 08:57

mwielgus approved these changes Sep 30, 2022

View reviewed changes

k8s-ci-robot assigned mwielgus Sep 30, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 30, 2022

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 30, 2022

k8s-ci-robot merged commit 9cae42c into kubernetes:master Sep 30, 2022

navinjoy pushed a commit to navinjoy/autoscaler that referenced this pull request Oct 26, 2022

Merge pull request kubernetes#5227 from yaroslava-serdiuk/batch-test

42d1736

Fix race condition in scale down test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race condition in scale down test #5227

Fix race condition in scale down test #5227

yaroslava-serdiuk commented Sep 30, 2022

yaroslava-serdiuk commented Sep 30, 2022

mwielgus commented Sep 30, 2022

yaroslava-serdiuk commented Sep 30, 2022

mwielgus left a comment

k8s-ci-robot commented Sep 30, 2022

x13n commented Sep 30, 2022

x13n commented Sep 30, 2022

x13n commented Sep 30, 2022

yaroslava-serdiuk commented Oct 4, 2022

x13n commented Oct 6, 2022

Fix race condition in scale down test #5227

Fix race condition in scale down test #5227

Conversation

yaroslava-serdiuk commented Sep 30, 2022

Which component this PR applies to?

What this PR does / why we need it:

yaroslava-serdiuk commented Sep 30, 2022

mwielgus commented Sep 30, 2022

yaroslava-serdiuk commented Sep 30, 2022

mwielgus left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Sep 30, 2022

x13n commented Sep 30, 2022

x13n commented Sep 30, 2022

x13n commented Sep 30, 2022

yaroslava-serdiuk commented Oct 4, 2022

x13n commented Oct 6, 2022