Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: protect WaitGroup decrement in CopyIn via sync.Once #115712

Merged
merged 1 commit into from
Dec 7, 2023

Conversation

yuzefovich
Copy link
Member

@yuzefovich yuzefovich commented Dec 6, 2023

We've recently seen "negative WaitGroup counter" server crash during COPY FROM execution a few times, but we have been unable to understand the root cause. It appears that the problem can happen right after the COPY execution is canceled due to statement_timeout. The synchronization setup is the following:

  • the network-handling goroutine calls wg.Add(1), pushes CopyIn command onto the stmt buf, and then blocks via wg.Wait()
  • the copy-handling connExecutor calls wg.Done() in the defer of execCopyIn. It must be the case that that defer is executed at least twice, but it's unclear to me how that can happen.

In the absence of understanding of how this can happen and with no reproduction, this commit attempts to mitigate the problem by ensuring that wg.Done() is called exactly once. This is achieved via sync.Once.

Fixes: #112095.

Release note: None

@yuzefovich yuzefovich added backport-23.1.x Flags PRs that need to be backported to 23.1 backport-23.2.x Flags PRs that need to be backported to 23.2. labels Dec 6, 2023
@yuzefovich yuzefovich requested review from rafiss and michae2 December 6, 2023 18:17
@yuzefovich yuzefovich requested review from a team as code owners December 6, 2023 18:17
Copy link

blathers-crl bot commented Dec 6, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@yuzefovich
Copy link
Member Author

I'm not very happy with this change, but I don't have any other ideas on how to mitigate #112095 / https://github.com/cockroachlabs/support/issues/2741. Curious about thoughts on whether it's worth making this change or not.

@@ -3006,7 +3006,9 @@ func (ex *connExecutor) execCopyIn(
}()

// When we're done, unblock the network connection.
defer cmd.CopyDone.Done()
defer func() {
Copy link
Collaborator

@rafiss rafiss Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one more idea instead of a buffered channel: could we change this to something like:

once.Do(func() { close(cmd.CopyDone) })

or

once.Do(func() { cmd.CopyDone.Done() })

@yuzefovich yuzefovich changed the title sql: replace WaitGroup in CopyIn with a channel sql: protect WaitGroup decrement in CopyIn via sync.Once Dec 6, 2023
Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @rafiss)


pkg/sql/conn_executor.go line 3009 at r2 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

one more idea instead of a buffered channel: could we change this to something like:

once.Do(func() { close(cmd.CopyDone) })

or

once.Do(func() { cmd.CopyDone.Done() })

I like it, thanks, done.

Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. ty for fixing!

@@ -370,6 +370,8 @@ type CopyIn struct {
// CopyDone is decremented once execution finishes, signaling that control of
// the connection is being handed back to the network routine.
CopyDone *sync.WaitGroup
// Once is used to decrement CopyDone exactly once.
Once *sync.Once
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one thought i had is if this would be more clear with an embedded struct:

	CopyDone struct {
		*sync.WaitGroup
		*sync.Once
	}

(so that it's obvious the two should be used together

We've recently seen "negative WaitGroup counter" server crash during
COPY FROM execution a few times, but we have been unable to understand
the root cause. It appears that the problem can happen right after the
COPY execution is canceled due to `statement_timeout`. The
synchronization setup is the following:
- the network-handling goroutine calls `wg.Add(1)`, pushes CopyIn
command onto the stmt buf, and then blocks via `wg.Wait()`
- the copy-handling connExecutor calls `wg.Done()` in the defer of
`execCopyIn`. It must be the case that that defer is executed at least
twice, but it's unclear to me how that can happen.

In the absence of understanding of how this can happen and with no
reproduction, this commit attempts to mitigate the problem by ensuring
that `wg.Done()` is called exactly once. This is achieved via
`sync.Once`.

Release note: None
Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @rafiss)


pkg/sql/conn_io.go line 374 at r3 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

one thought i had is if this would be more clear with an embedded struct:

	CopyDone struct {
		*sync.WaitGroup
		*sync.Once
	}

(so that it's obvious the two should be used together

Very nice, done.

@craig
Copy link
Contributor

craig bot commented Dec 7, 2023

Build succeeded:

@craig craig bot merged commit af1fda5 into cockroachdb:master Dec 7, 2023
8 of 9 checks passed
Copy link

blathers-crl bot commented Dec 7, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error setting reviewers, but backport branch blathers/backport-release-23.1-115712 is ready: POST https://api.github.com/repos/cockroachdb/cockroach/pulls/115799/requested_reviewers: 422 Reviews may only be requested from collaborators. One or more of the teams you specified is not a collaborator of the cockroachdb/cockroach repository. []

Backport to branch 23.1.x failed. See errors above.


error setting reviewers, but backport branch blathers/backport-release-23.2-115712 is ready: POST https://api.github.com/repos/cockroachdb/cockroach/pulls/115800/requested_reviewers: 422 Reviews may only be requested from collaborators. One or more of the teams you specified is not a collaborator of the cockroachdb/cockroach repository. []

Backport to branch 23.2.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@yuzefovich yuzefovich deleted the copy-wg branch December 7, 2023 20:30
@yuzefovich
Copy link
Member Author

blathers backport release-23.2.0

Copy link

blathers-crl bot commented Dec 7, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error getting backport branch release-release-23.2.0: unexpected status code: 404 Not Found

Backport to branch release-23.2.0 failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@yuzefovich
Copy link
Member Author

blathers backport 23.2.0

Copy link

blathers-crl bot commented Dec 7, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error getting backport branch release-23.2.0: unexpected status code: 404 Not Found

Backport to branch 23.2.0 failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@yuzefovich
Copy link
Member Author

blathers backport 23.2.0-rc

Copy link

blathers-crl bot commented Dec 7, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error setting reviewers, but backport branch blathers/backport-release-23.2.0-rc-115712 is ready: POST https://api.github.com/repos/cockroachdb/cockroach/pulls/115829/requested_reviewers: 422 Reviews may only be requested from collaborators. One or more of the teams you specified is not a collaborator of the cockroachdb/cockroach repository. []

Backport to branch 23.2.0-rc failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1 backport-23.2.x Flags PRs that need to be backported to 23.2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql: v23.1.11: panic in execCopyIn
3 participants