Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: fix a race with finishing a span and executing RPCs #95482

Merged
merged 1 commit into from
Jan 19, 2023

Conversation

yuzefovich
Copy link
Member

This commit fixes a race that was introduced when we fully parallelized the execution of the setup of the remote flows. In particular, it became possible for a setup-flow-async "runner" tracing span be finished before a concurrent SetupFlow RPC is issued, which uses that span as a parent for the RPC span. The issue is only present in the case that we didn't spin up a "listener" goroutine for whatever reason (most likely because the server is quescing), in which case we didn't block to wait for the concurrent RPCs to be evaluated. We canceled the context of those RPCs, but it appears that the context cancellation is not immediately noticed by the RPC call (or at least it is noticed after the gRPC interceptor code attempt to create a new tracing span). This is now fixed by blocking the cleanup until all concurrently-issued RPCs are completed.

Fixes: #92809.

Release note: None

This commit fixes a race that was introduced when we fully parallelized
the execution of the setup of the remote flows. In particular, it became
possible for a `setup-flow-async` "runner" tracing span be finished
before a concurrent `SetupFlow` RPC is issued, which uses that span as
a parent for the RPC span. The issue is only present in the case that we
didn't spin up a "listener" goroutine for whatever reason (most likely
because the server is quescing), in which case we didn't block to wait
for the concurrent RPCs to be evaluated. We canceled the context of
those RPCs, but it appears that the context cancellation is not
immediately noticed by the RPC call (or at least it is noticed _after_
the gRPC interceptor code attempt to create a new tracing span). This is
now fixed by blocking the cleanup until all concurrently-issued RPCs are
completed.

Release note: None
@yuzefovich yuzefovich requested review from andreimatei, msirek and a team January 19, 2023 01:39
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@yuzefovich
Copy link
Member Author

On one of the logic tests, I reproduced the problem in 4 minutes before this patch and got no failures with the patch in 55 minutes.

Copy link
Contributor

@msirek msirek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @yuzefovich)


pkg/sql/distsql_running.go line 519 at r1 (raw file):

			// send on the result channel.
			for i := 0; i < numAsyncRequests+numSerialRequests; i++ {
				<-resultChan

If runnerRequest.run encounters a fatal error, might a send to the channel get missed causing us to block indefinitely here? Perhaps using defer in runnerRequest.run could ensure we always send on the channel.

Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @msirek)


pkg/sql/distsql_running.go line 519 at r1 (raw file):

Previously, msirek (Mark Sirek) wrote…

If runnerRequest.run encounters a fatal error, might a send to the channel get missed causing us to block indefinitely here? Perhaps using defer in runnerRequest.run could ensure we always send on the channel.

If that runner goroutine encounters a fatal crash, then the whole process will crash, and it doesn't matter whether we'd block here or not.

I do see that deferring a send on the result channel in runnerRequest.run would make that code slightly easier to follow (i.e. it'll be more clear that under all circumstances (modulo a panic) a single object is sent on the results channel), so I could make the corresponding change if you think it's worth it. This would result in an additional function allocation though, but it's probably not that important, and personally I'm ok with either option.

Copy link
Contributor

@msirek msirek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei and @yuzefovich)


pkg/sql/distsql_running.go line 519 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

If that runner goroutine encounters a fatal crash, then the whole process will crash, and it doesn't matter whether we'd block here or not.

I do see that deferring a send on the result channel in runnerRequest.run would make that code slightly easier to follow (i.e. it'll be more clear that under all circumstances (modulo a panic) a single object is sent on the results channel), so I could make the corresponding change if you think it's worth it. This would result in an additional function allocation though, but it's probably not that important, and personally I'm ok with either option.

Oh, if the whole process crashes (when there's a panic), then maybe there's no point to use defer. Thanks for clarifying. Maybe there is potential for future improvement by adding a panic catcher.

Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

bors r+

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei and @msirek)


pkg/sql/distsql_running.go line 519 at r1 (raw file):

Previously, msirek (Mark Sirek) wrote…

Oh, if the whole process crashes (when there's a panic), then maybe there's no point to use defer. Thanks for clarifying. Maybe there is potential for future improvement by adding a panic catcher.

In this case a panic-catcher doesn't seem worth it - runnerRequest.run seems so simple that if we do hit a panic for some reason I can't really imagine (without any code modifications), then something has gone truly wrong, and it might actually be better to crash. FWIW this code hasn't really been changed in over 5 years, so it should be pretty safe :)

Copy link
Contributor

@msirek msirek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei and @yuzefovich)


pkg/sql/distsql_running.go line 519 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

In this case a panic-catcher doesn't seem worth it - runnerRequest.run seems so simple that if we do hit a panic for some reason I can't really imagine (without any code modifications), then something has gone truly wrong, and it might actually be better to crash. FWIW this code hasn't really been changed in over 5 years, so it should be pretty safe :)

OK, thanks

@yuzefovich
Copy link
Member Author

It appears that several recent bors runs failed due to this flake, bumping the priority.

bors r+ p=999

@craig
Copy link
Contributor

craig bot commented Jan 19, 2023

Already running a review

@craig
Copy link
Contributor

craig bot commented Jan 19, 2023

Build succeeded:

@craig craig bot merged commit b6434c9 into cockroachdb:master Jan 19, 2023
@yuzefovich yuzefovich deleted the setup-flow-span branch January 19, 2023 06:57
Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: but please go through my comments

Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained


pkg/sql/distsql_running.go line 469 at r1 (raw file):

	// numAsyncRequests tracks the number of the SetupFlow RPCs that were
	// delegated to the DistSQL runner goroutines.
	var numAsyncRequests int

We only seem to use the sum of these new variables; do we need to introduce them?
Particularly since one of the code paths seems to refer to the sum as len(flows)-1.

I would also consider trying to use a higher-level structure for managing the fanout, like an error group (perhaps by replacing runnerRequest.resultChan with a callback). That would remove the need to keep the producer and consumers in sync with respect to the number of requests, and would probably also make the error deduplication code nicer.


pkg/sql/distsql_running.go line 519 at r1 (raw file):

If that runner goroutine encounters a fatal crash, then the whole process will crash, and it doesn't matter whether we'd block here or not.

Not in the synchronous call case. In that case, if there were such a panic, I think we would end up deadlocking on the unwind path (right?).
It's not a huge deal in my book, but since that function already uses a defer, it'd be good to also deal with the channel in a defer. I don't think it'd be an extra allocation (or at least I hope it wouldn't). I think most allocations from deferred went away many compiler versions ago.
This discussion would be moot with an errorGroup.

Maybe there is potential for future improvement by adding a panic catcher.

In this case a panic-catcher doesn't seem worth it - runnerRequest.run seems so simple that if we do hit a panic for some reason I can't really imagine (without any code modifications), then something has gone truly wrong, and it might actually be better to crash.

This is always the case, in my opinion. Recovering from panics (i.e. recovering without re-panicking) is a fool's game.


pkg/sql/distsql_running.go line 557 at r1 (raw file):

		}
	}
	if buildutil.CrdbTestBuild {

Why is this a test-only assertion? The check is cheap (particularly if you make copies of the printed variables inside the check block to prevent the originals from escaping).
In my opinion, making checks test only should only be done when there's a specific reason to do so because the extra if for test check is generally not worth having. But in this specific case, in addition, if the assertion fails, then we'll probably deadlock something, which is probably worse than the crash you're trying to avoid.

Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained


pkg/sql/distsql_running.go line 469 at r1 (raw file):
We need two pieces of information: the total number of RPCs issued (which can be smaller than len(flows)-1 when a synchronous RPC call results in an error), and whether we issued at least one RPC asynchronously by the DistSQL worker (in which case we must spin up a new "listener" goroutine). This can be achieved by having two ints or an int and a bool. Perhaps, the latter option is cleaner, I'll open up a PR.

Particularly since one of the code paths seems to refer to the sum as len(flows)-1.

That code path is only reached if no error was encountered on a synchronous RPC call, in which case sum of these variables always equals len(flows)-1.

I would also consider trying to use a higher-level structure for managing the fanout, like an error group

IIUC correctly errgroup.Group would be able to spin up new goroutines on demand in order to issue RPCs asynchronously. However, this is not the behavior that we want (or at least currently expect) - we already have a pool of DistSQL worker goroutines (with 4 x vCPUs number of goroutines by default) that are shared and reused across all queries in the node, and it'd likely be worse to always spin up new goroutines just to issue a single RPC. Am I missing something about the error group? Let's continue this discussion on a new PR I'm about to open.


pkg/sql/distsql_running.go line 519 at r1 (raw file):

Not in the synchronous call case. In that case, if there were such a panic, I think we would end up deadlocking on the unwind path (right?).

Yeah, you're right. I'll adjust it to use the defer.

Recovering from panics (i.e. recovering without re-panicking) is a fool's game.

Unless you're using panic-catching mechanism of error propagation (which we do in the vectorized execution engine and in the optimizer). It's a separate discussion whether such pattern is a good one, but I believe on SQL Queries team we like it (e.g. in the vectorized engine it significantly improved the stability of the system by converting crashes into internal errors).


pkg/sql/distsql_running.go line 557 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Why is this a test-only assertion? The check is cheap (particularly if you make copies of the printed variables inside the check block to prevent the originals from escaping).
In my opinion, making checks test only should only be done when there's a specific reason to do so because the extra if for test check is generally not worth having. But in this specific case, in addition, if the assertion fails, then we'll probably deadlock something, which is probably worse than the crash you're trying to avoid.

I'll just remove this assertion since it's probably not that useful. I don't believe we'd deadlock if this assertion fired though, quite the opposite - it could prevent a deadlock, or at least leaking a goroutine (if we issued less requests than len(flows)-1, then the listener goroutine would never exit).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

logictest: setup-flow-async span is used after finish
4 participants