Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: fix a race when reading server object #108371

Closed

Conversation

lidorcarmel
Copy link
Contributor

@lidorcarmel lidorcarmel commented Aug 8, 2023

Avoid returning server when it's not ready - we shouldn't read it when started is false (that's a race).

Testing: this test (sometimes) fails without this pr and succeeds with it:
./dev test pkg/ccl/streamingccl/streamingest:streamingest_test -f TestDataDriven --race -- --runs_per_test=10

Epic: none
Informs: #107930

Release note: None

@lidorcarmel lidorcarmel requested review from knz and stevendanna August 8, 2023 17:32
@blathers-crl
Copy link

blathers-crl bot commented Aug 8, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@lidorcarmel lidorcarmel added the backport-23.1.x Flags PRs that need to be backported to 23.1 label Aug 8, 2023
@lidorcarmel lidorcarmel marked this pull request as ready for review August 8, 2023 17:54
@lidorcarmel lidorcarmel requested review from a team as code owners August 8, 2023 17:54
Avoid returning `server` when it's not ready - we shouldn't read it when
`started` is false (that's a race).

Testing: this test (sometimes) fails without this pr and succeeds with it:
`./dev test pkg/ccl/streamingccl/streamingest:streamingest_test -f TestDataDriven --race -- --runs_per_test=10`

Epic: none
Informs: cockroachdb#107930

Release note: None
@@ -106,7 +106,10 @@ var _ serverState = (*serverStateUsingChannels)(nil)

// getServer is part of the serverState interface.
func (s *serverStateUsingChannels) getServer() (orchestratedServer, bool) {
return s.server, s.started.Get()
if s.started.Get() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change seems incorrect. It was already the case that the callers of this method should not consider the 1st return value if the 2nd was false.

you observing a difference means there are some callers that were not using the API properly. That is the bug we need to find and fix.

I have searched all the current callers of this, and I cannot see any one of them using the API incorrectly.

Can you share the steps you used to assert this is making a difference?

@knz
Copy link
Contributor

knz commented Aug 8, 2023

An alternate approach would be to recognize there is a race condition between the getServer method and the assignments in (*channelOrchestrator).startControlledServer.

Maybe there should be a mutex here to tie them together.

@knz
Copy link
Contributor

knz commented Aug 9, 2023

let's continue here: #108401

@lidorcarmel
Copy link
Contributor Author

I think the mutex is better than atomic here, so no need for this PR but.. quickly responding to your questions:

  • repro was using the command in the PR description (but it's a flaky test.. so maybe not super convincing)
  • unless I'm missing something, the race is when reading the server pointer, which we read even if the (atomic) bool is false, meaning maybe some other thread is currently writing to that pointer. I agree that the caller will never use that pointer but we still tried reading it which we shouldn't.

@lidorcarmel lidorcarmel closed this Aug 9, 2023
craig bot pushed a commit that referenced this pull request Aug 9, 2023
108401: server: avoid a race in the server controller r=lidorcarmel a=knz

Co-authored with `@lidorcarmel` 

Prior to this patch, the Go race detector was complaining about racy concurrent accesses to `serverStateUsingChannels.server`, via `(*serverStateUsingChannels) getServer()` and `(*channelOrchestrator) startControlledServer()`.

These racy accesses happened to be safe because the writes and reads to that field were correctly ordered around updates to an atomic bool (the `started` field). However, the race detector is not sufficiently sophisticated to detect this ordering and satisfy itself that the state transition can only happen once.

In order to silence the race detector (with no change in correctness), this patch replaces the atomic bool by a mutex, whose access semantics are properly understood by the race detector.

This change incidentally makes the code slightly easier to read and understand.

Supersedes #108371.
Fixes #107930.
Epic: CRDB-28893

Co-authored-by: Raphael 'kena' Poss <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants