Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
sql: replace WaitGroup in CopyIn with a channel
We've recently seen "negative WaitGroup counter" server crash during COPY FROM execution a few times, but we have been unable to understand the root cause. It appears that the problem can happen right after the COPY execution is canceled due to `statement_timeout`. The synchronization setup is the following: - the network-handling goroutine calls `wg.Add(1)`, pushes CopyIn command onto the stmt buf, and then blocks via `wg.Wait()` - the copy-handling connExecutor calls `wg.Done()` in the defer of `execCopyIn`. It must be the case that that defer is executed at least twice, but it's unclear to me how that can happen. In the absence of understanding of how this can happen and with no reproduction, this commit attempts to mitigate the problem by switching from the wait group to a channel. In particular, now: - the network-handling goroutine will block until something is sent on the channel - the copy-handling connExecutor goroutine will send on the channel in the defer in `execCopyIn`. The channel is buffered, so up to 4 sends on the channel are allowed even though the network-handling goroutine will be unblocked on the very first one. The risk of this change is that with multiple sends on the channel we enter into an undefined territory. In particular, the contract is such that `execCopyIn` takes over the connection, and now it seems possible that the network-handling goroutine wakes up after the first send while the copy-handling connExecutor doesn't exit, so the latter could continue reading from the connection. In other words, we replace a server crash (which is very bad) with an undefined behavior (which could be very bad too). Release note: None
- Loading branch information