Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: fix TestVMPreemptionPolling data race #135312

Merged
merged 1 commit into from
Nov 19, 2024

Conversation

DarrylWong
Copy link
Contributor

@DarrylWong DarrylWong commented Nov 15, 2024

This change switches to pollPreemptionInterval to be a mutex protected struct instead, as multiple unit tests modify it and can lead to a data race without.

Fixes: #135267
Epic: none
Release note: none

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@DarrylWong DarrylWong marked this pull request as ready for review November 15, 2024 18:56
@DarrylWong DarrylWong requested a review from a team as a code owner November 15, 2024 18:56
@DarrylWong DarrylWong requested review from srosenberg and nameisbhaskar and removed request for a team November 15, 2024 18:56
@@ -2139,13 +2139,14 @@ func monitorForPreemptedVMs(ctx context.Context, t test.Test, c cluster.Cluster,
if c.IsLocal() || !c.Spec().UseSpotVMs {
return
}
interval := pollPreemptionInterval
Copy link
Member

@srosenberg srosenberg Nov 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to mention we might need a mutex on the previous PR. In general, when there are multiple hooks (for mocking), it might be best to stick them in a mutex-protected struct.

That aside, I'm a bit puzzled by the data race in [1]. The subtests are executed sequentially, which btw you'd still have a race, if t.Parallel() is enabled. The write on L754 races with the read of the previous subtest, right? That would imply that the runCtx passed into monitorForPreemptedVMs wasn't cancelled upon completion. If that's the case, we have a leaking goroutine.

[1] #135267

Copy link
Contributor Author

@DarrylWong DarrylWong Nov 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, I was going to but forgot to leave a comment saying something along the lines of "I'm not really sure how this fixes the data race because I'm not 100% what the data race is".

That would imply that the runCtx passed into monitorForPreemptedVMs wasn't cancelled upon completion. If that's the case, we have a leaking goroutine.

That was my initial thought too, but I commented out the first of the two tests and was still getting the data race. Even weirder is that when both tests are running, the data race always happens on the second. That would suggest it is reliant on both tests running but it isn't. To further add to that, commenting out the second test causes the first to start failing with a data race.

Not really sure what to make of that 😅 I also tried adding a defer leaktest.AfterTest which didn't catch anything.

The subtests are executed sequentially, which btw you'd still have a race, if t.Parallel() is enabled

it might be best to stick them in a mutex-protected struct.

Yeah, I thought of that but figured it would be unfortunate to have to add a mutex for a unit test only hook. I can go that route though if we think it's safer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We both missed an essential hint in the data race report–Goroutine 741 (finished). At the time when the data race report is printed [1], the goroutine running monitorForPreemptedVMs has already exited. This confirms that the runCtx is indeed cancelled.

We can also conclude that it's the write of the second subtest which races with the read of the first subtest. The sequence of (scheduled) events is roughly as follows. The goroutine of the first subtest exits before monitorForPreemptedVMs exits; at this point, tsan has recorded a read from the corresponding goroutine id. Next, the goroutine for the second subtest is executed; tsan records the write. A data race is detected. Subsequently, monitorForPreemptedVMs exits. The reporter iterates through the implicated goroutines, printing the stack traces, at which point Goroutine 741 already finished :)

Your fix works because the read in monitorForPreemptedVMs must happen before the first subtest goroutine exits, and transitively it happens before the second subtest does the write. s.Run essentially establishes the happens-before order; from runTest,

monitorForPreemptedVMs(runCtx, t, c, l)
// This is the call to actually run the test.
s.Run(runCtx, t, c)

Thus, the only read in monitorForPreemptedVMs now happens before s.Run exits, and by extension runTest.

Yeah, I thought of that but figured it would be unfortunate to have to add a mutex for a unit test only hook. I can go that route though if we think it's safer.

It might be a good idea since the reason the current fix works is rather non-trivial.

[1] https://github.com/llvm/llvm-project/blob/8c7c8eaa1933d24c1eb869ba85469908547e3677/compiler-rt/lib/tsan/rtl/tsan_report.cpp#L424

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, yes I think that explanation checks out.

but I commented out the first of the two tests and was still getting the data race.

I should've taken a closer look at the data race I was getting, it was due to the defer statement which would fail under race for the same reason, but explains why it was failing with only one subtest running.

It might be a good idea since the reason the current fix works is rather non-trivial.

Done.

@DarrylWong DarrylWong force-pushed the deflake-test-vm-preempt branch 2 times, most recently from f3dcca2 to befb39d Compare November 18, 2024 17:07
interval := pollPreemptionInterval.interval
// If no interval is set, default to 5 minutes.
if interval == 0 {
interval = 5 * time.Minute
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This belongs in the init function; otherwise, it's not exactly isomorphic to the previous version; e.g., interval could be reset to 0 (by another thread).

Copy link
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, modulo a tiny init-related comment.

This change switches to pollPreemptionInterval to be a
mutex protected struct instead, as multiple unit tests
modify it and can lead to a data race without.

Fixes: cockroachdb#135267
Epic: none
Release note: none
@DarrylWong DarrylWong force-pushed the deflake-test-vm-preempt branch from befb39d to 8e3466d Compare November 19, 2024 15:41
@DarrylWong
Copy link
Contributor Author

TFTRs!

bors r=srosenberg, herkolategan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cmd/roachtest: TestVMPreemptionPolling failed
4 participants