-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: protect WaitGroup decrement in CopyIn via sync.Once #115712
Conversation
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
I'm not very happy with this change, but I don't have any other ideas on how to mitigate #112095 / https://github.com/cockroachlabs/support/issues/2741. Curious about thoughts on whether it's worth making this change or not. |
@@ -3006,7 +3006,9 @@ func (ex *connExecutor) execCopyIn( | |||
}() | |||
|
|||
// When we're done, unblock the network connection. | |||
defer cmd.CopyDone.Done() | |||
defer func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one more idea instead of a buffered channel: could we change this to something like:
once.Do(func() { close(cmd.CopyDone) })
or
once.Do(func() { cmd.CopyDone.Done() })
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @rafiss)
pkg/sql/conn_executor.go
line 3009 at r2 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
one more idea instead of a buffered channel: could we change this to something like:
once.Do(func() { close(cmd.CopyDone) })
or
once.Do(func() { cmd.CopyDone.Done() })
I like it, thanks, done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. ty for fixing!
pkg/sql/conn_io.go
Outdated
@@ -370,6 +370,8 @@ type CopyIn struct { | |||
// CopyDone is decremented once execution finishes, signaling that control of | |||
// the connection is being handed back to the network routine. | |||
CopyDone *sync.WaitGroup | |||
// Once is used to decrement CopyDone exactly once. | |||
Once *sync.Once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one thought i had is if this would be more clear with an embedded struct:
CopyDone struct {
*sync.WaitGroup
*sync.Once
}
(so that it's obvious the two should be used together
We've recently seen "negative WaitGroup counter" server crash during COPY FROM execution a few times, but we have been unable to understand the root cause. It appears that the problem can happen right after the COPY execution is canceled due to `statement_timeout`. The synchronization setup is the following: - the network-handling goroutine calls `wg.Add(1)`, pushes CopyIn command onto the stmt buf, and then blocks via `wg.Wait()` - the copy-handling connExecutor calls `wg.Done()` in the defer of `execCopyIn`. It must be the case that that defer is executed at least twice, but it's unclear to me how that can happen. In the absence of understanding of how this can happen and with no reproduction, this commit attempts to mitigate the problem by ensuring that `wg.Done()` is called exactly once. This is achieved via `sync.Once`. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
bors r+
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @rafiss)
pkg/sql/conn_io.go
line 374 at r3 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
one thought i had is if this would be more clear with an embedded struct:
CopyDone struct { *sync.WaitGroup *sync.Once }
(so that it's obvious the two should be used together
Very nice, done.
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error setting reviewers, but backport branch blathers/backport-release-23.1-115712 is ready: POST https://api.github.com/repos/cockroachdb/cockroach/pulls/115799/requested_reviewers: 422 Reviews may only be requested from collaborators. One or more of the teams you specified is not a collaborator of the cockroachdb/cockroach repository. [] Backport to branch 23.1.x failed. See errors above. error setting reviewers, but backport branch blathers/backport-release-23.2-115712 is ready: POST https://api.github.com/repos/cockroachdb/cockroach/pulls/115800/requested_reviewers: 422 Reviews may only be requested from collaborators. One or more of the teams you specified is not a collaborator of the cockroachdb/cockroach repository. [] Backport to branch 23.2.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
blathers backport release-23.2.0 |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error getting backport branch release-release-23.2.0: unexpected status code: 404 Not Found Backport to branch release-23.2.0 failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
blathers backport 23.2.0 |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error getting backport branch release-23.2.0: unexpected status code: 404 Not Found Backport to branch 23.2.0 failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
blathers backport 23.2.0-rc |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error setting reviewers, but backport branch blathers/backport-release-23.2.0-rc-115712 is ready: POST https://api.github.com/repos/cockroachdb/cockroach/pulls/115829/requested_reviewers: 422 Reviews may only be requested from collaborators. One or more of the teams you specified is not a collaborator of the cockroachdb/cockroach repository. [] Backport to branch 23.2.0-rc failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
We've recently seen "negative WaitGroup counter" server crash during COPY FROM execution a few times, but we have been unable to understand the root cause. It appears that the problem can happen right after the COPY execution is canceled due to
statement_timeout
. The synchronization setup is the following:wg.Add(1)
, pushes CopyIn command onto the stmt buf, and then blocks viawg.Wait()
wg.Done()
in the defer ofexecCopyIn
. It must be the case that that defer is executed at least twice, but it's unclear to me how that can happen.In the absence of understanding of how this can happen and with no reproduction, this commit attempts to mitigate the problem by ensuring that
wg.Done()
is called exactly once. This is achieved viasync.Once
.Fixes: #112095.
Release note: None