-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: fix a race with finishing a span and executing RPCs #95482
Conversation
This commit fixes a race that was introduced when we fully parallelized the execution of the setup of the remote flows. In particular, it became possible for a `setup-flow-async` "runner" tracing span be finished before a concurrent `SetupFlow` RPC is issued, which uses that span as a parent for the RPC span. The issue is only present in the case that we didn't spin up a "listener" goroutine for whatever reason (most likely because the server is quescing), in which case we didn't block to wait for the concurrent RPCs to be evaluated. We canceled the context of those RPCs, but it appears that the context cancellation is not immediately noticed by the RPC call (or at least it is noticed _after_ the gRPC interceptor code attempt to create a new tracing span). This is now fixed by blocking the cleanup until all concurrently-issued RPCs are completed. Release note: None
On one of the logic tests, I reproduced the problem in 4 minutes before this patch and got no failures with the patch in 55 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @yuzefovich)
pkg/sql/distsql_running.go
line 519 at r1 (raw file):
// send on the result channel. for i := 0; i < numAsyncRequests+numSerialRequests; i++ { <-resultChan
If runnerRequest.run
encounters a fatal error, might a send to the channel get missed causing us to block indefinitely here? Perhaps using defer
in runnerRequest.run
could ensure we always send on the channel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @msirek)
pkg/sql/distsql_running.go
line 519 at r1 (raw file):
Previously, msirek (Mark Sirek) wrote…
If
runnerRequest.run
encounters a fatal error, might a send to the channel get missed causing us to block indefinitely here? Perhaps usingdefer
inrunnerRequest.run
could ensure we always send on the channel.
If that runner goroutine encounters a fatal crash, then the whole process will crash, and it doesn't matter whether we'd block here or not.
I do see that defer
ring a send on the result channel in runnerRequest.run
would make that code slightly easier to follow (i.e. it'll be more clear that under all circumstances (modulo a panic) a single object is sent on the results channel), so I could make the corresponding change if you think it's worth it. This would result in an additional function allocation though, but it's probably not that important, and personally I'm ok with either option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei and @yuzefovich)
pkg/sql/distsql_running.go
line 519 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
If that runner goroutine encounters a fatal crash, then the whole process will crash, and it doesn't matter whether we'd block here or not.
I do see that
defer
ring a send on the result channel inrunnerRequest.run
would make that code slightly easier to follow (i.e. it'll be more clear that under all circumstances (modulo a panic) a single object is sent on the results channel), so I could make the corresponding change if you think it's worth it. This would result in an additional function allocation though, but it's probably not that important, and personally I'm ok with either option.
Oh, if the whole process crashes (when there's a panic), then maybe there's no point to use defer
. Thanks for clarifying. Maybe there is potential for future improvement by adding a panic catcher.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
bors r+
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei and @msirek)
pkg/sql/distsql_running.go
line 519 at r1 (raw file):
Previously, msirek (Mark Sirek) wrote…
Oh, if the whole process crashes (when there's a panic), then maybe there's no point to use
defer
. Thanks for clarifying. Maybe there is potential for future improvement by adding a panic catcher.
In this case a panic-catcher doesn't seem worth it - runnerRequest.run
seems so simple that if we do hit a panic for some reason I can't really imagine (without any code modifications), then something has gone truly wrong, and it might actually be better to crash. FWIW this code hasn't really been changed in over 5 years, so it should be pretty safe :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei and @yuzefovich)
pkg/sql/distsql_running.go
line 519 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
In this case a panic-catcher doesn't seem worth it -
runnerRequest.run
seems so simple that if we do hit a panic for some reason I can't really imagine (without any code modifications), then something has gone truly wrong, and it might actually be better to crash. FWIW this code hasn't really been changed in over 5 years, so it should be pretty safe :)
OK, thanks
It appears that several recent bors runs failed due to this flake, bumping the priority. bors r+ p=999 |
Already running a review |
Build succeeded: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but please go through my comments
Reviewable status: complete! 2 of 0 LGTMs obtained
pkg/sql/distsql_running.go
line 469 at r1 (raw file):
// numAsyncRequests tracks the number of the SetupFlow RPCs that were // delegated to the DistSQL runner goroutines. var numAsyncRequests int
We only seem to use the sum of these new variables; do we need to introduce them?
Particularly since one of the code paths seems to refer to the sum as len(flows)-1
.
I would also consider trying to use a higher-level structure for managing the fanout, like an error group (perhaps by replacing runnerRequest.resultChan
with a callback). That would remove the need to keep the producer and consumers in sync with respect to the number of requests, and would probably also make the error deduplication code nicer.
pkg/sql/distsql_running.go
line 519 at r1 (raw file):
If that runner goroutine encounters a fatal crash, then the whole process will crash, and it doesn't matter whether we'd block here or not.
Not in the synchronous call case. In that case, if there were such a panic, I think we would end up deadlocking on the unwind path (right?).
It's not a huge deal in my book, but since that function already uses a defer, it'd be good to also deal with the channel in a defer. I don't think it'd be an extra allocation (or at least I hope it wouldn't). I think most allocations from deferred went away many compiler versions ago.
This discussion would be moot with an errorGroup.
Maybe there is potential for future improvement by adding a panic catcher.
In this case a panic-catcher doesn't seem worth it - runnerRequest.run seems so simple that if we do hit a panic for some reason I can't really imagine (without any code modifications), then something has gone truly wrong, and it might actually be better to crash.
This is always the case, in my opinion. Recovering from panics (i.e. recovering without re-panicking) is a fool's game.
pkg/sql/distsql_running.go
line 557 at r1 (raw file):
} } if buildutil.CrdbTestBuild {
Why is this a test-only assertion? The check is cheap (particularly if you make copies of the printed variables inside the check block to prevent the originals from escaping).
In my opinion, making checks test only should only be done when there's a specific reason to do so because the extra if
for test check is generally not worth having. But in this specific case, in addition, if the assertion fails, then we'll probably deadlock something, which is probably worse than the crash you're trying to avoid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 2 of 0 LGTMs obtained
pkg/sql/distsql_running.go
line 469 at r1 (raw file):
We need two pieces of information: the total number of RPCs issued (which can be smaller than len(flows)-1
when a synchronous RPC call results in an error), and whether we issued at least one RPC asynchronously by the DistSQL worker (in which case we must spin up a new "listener" goroutine). This can be achieved by having two int
s or an int
and a bool
. Perhaps, the latter option is cleaner, I'll open up a PR.
Particularly since one of the code paths seems to refer to the sum as
len(flows)-1
.
That code path is only reached if no error was encountered on a synchronous RPC call, in which case sum of these variables always equals len(flows)-1
.
I would also consider trying to use a higher-level structure for managing the fanout, like an error group
IIUC correctly errgroup.Group
would be able to spin up new goroutines on demand in order to issue RPCs asynchronously. However, this is not the behavior that we want (or at least currently expect) - we already have a pool of DistSQL worker goroutines (with 4 x vCPUs number of goroutines by default) that are shared and reused across all queries in the node, and it'd likely be worse to always spin up new goroutines just to issue a single RPC. Am I missing something about the error group? Let's continue this discussion on a new PR I'm about to open.
pkg/sql/distsql_running.go
line 519 at r1 (raw file):
Not in the synchronous call case. In that case, if there were such a panic, I think we would end up deadlocking on the unwind path (right?).
Yeah, you're right. I'll adjust it to use the defer.
Recovering from panics (i.e. recovering without re-panicking) is a fool's game.
Unless you're using panic-catching mechanism of error propagation (which we do in the vectorized execution engine and in the optimizer). It's a separate discussion whether such pattern is a good one, but I believe on SQL Queries team we like it (e.g. in the vectorized engine it significantly improved the stability of the system by converting crashes into internal errors).
pkg/sql/distsql_running.go
line 557 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
Why is this a test-only assertion? The check is cheap (particularly if you make copies of the printed variables inside the check block to prevent the originals from escaping).
In my opinion, making checks test only should only be done when there's a specific reason to do so because the extraif
for test check is generally not worth having. But in this specific case, in addition, if the assertion fails, then we'll probably deadlock something, which is probably worse than the crash you're trying to avoid.
I'll just remove this assertion since it's probably not that useful. I don't believe we'd deadlock if this assertion fired though, quite the opposite - it could prevent a deadlock, or at least leaking a goroutine (if we issued less requests than len(flows)-1
, then the listener goroutine would never exit).
This commit fixes a race that was introduced when we fully parallelized the execution of the setup of the remote flows. In particular, it became possible for a
setup-flow-async
"runner" tracing span be finished before a concurrentSetupFlow
RPC is issued, which uses that span as a parent for the RPC span. The issue is only present in the case that we didn't spin up a "listener" goroutine for whatever reason (most likely because the server is quescing), in which case we didn't block to wait for the concurrent RPCs to be evaluated. We canceled the context of those RPCs, but it appears that the context cancellation is not immediately noticed by the RPC call (or at least it is noticed after the gRPC interceptor code attempt to create a new tracing span). This is now fixed by blocking the cleanup until all concurrently-issued RPCs are completed.Fixes: #92809.
Release note: None