Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: disable pausable portals for all statements with mutations #108398

Merged

Conversation

rharding6373
Copy link
Collaborator

Previously we examined the AST to determine whether a statement could be executed in a pausable portal or not. However, this was insufficient to identify volatile UDFs that could also contain mutations.

This PR revokes a portal's pausability if the statement's plan contains a mutation. That is, if the opt builder determines that any operator is a mutation operator (see IsMutationOp).

Although there is some overlap between this PR and the existing IsAllowedToPause, this PR leaves the latter in place, since it restricts some statements that are not considered mutation operators, e.g., import operators.

Epic: None
Fixes: #107130

Release note: None

@rharding6373 rharding6373 requested review from a team as code owners August 8, 2023 23:07
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@rharding6373
Copy link
Collaborator Author

This is an alternative approach to #107998 based on review feedback that isn't quite as heavy handed. PTAL.

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rharding6373 and @ZhouXing19)


-- commits line 14 at r1:
Can these statements be PREPAREd? I'd really like to get rid of IsAllowedToPause if possible in order to check / revoke pausability in less places.


pkg/sql/conn_executor_exec.go line 2009 at r1 (raw file):

		// un-pausable (normal) portal.
		// With pauseInfo is nil, no cleanup function will be added to the stack
		// and all clean-up steps will be performed as for normal portals.

I think planner.curPlan.close will not be called this way because we already entered the branch for the pausable portal case. What do you think about lifting this "revoke" block outside of makeExecPlan, one level up the stack? Then it'd be more clear what's going on, and it'll be easier to revoke pausability correctly.


pkg/sql/conn_executor_exec.go line 2010 at r1 (raw file):

		// With pauseInfo is nil, no cleanup function will be added to the stack
		// and all clean-up steps will be performed as for normal portals.
		planner.pausablePortal.pauseInfo = nil

nit: should we also increment a telemetry counter (perhaps introduce a new one)?

@rharding6373 rharding6373 force-pushed the 20230808_portal_mutation_exec_107130 branch from aac62fc to bd5c0a6 Compare August 21, 2023 19:54
@rharding6373 rharding6373 requested a review from a team as a code owner August 21, 2023 19:54
@rharding6373 rharding6373 force-pushed the 20230808_portal_mutation_exec_107130 branch from bd5c0a6 to fde1a75 Compare August 21, 2023 19:56
Copy link
Collaborator Author

@rharding6373 rharding6373 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RFAL, thanks for the reviews so far!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich and @ZhouXing19)


-- commits line 14 at r1:

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Can these statements be PREPAREd? I'd really like to get rid of IsAllowedToPause if possible in order to check / revoke pausability in less places.

According to the docs (https://www.postgresql.org/docs/current/protocol-flow.html), any operator that doesn't return rows (like TRUNCATE) will execute to completion, so we don't necessarily have to prevent them from executing in a pausable portal, though @ZhouXing19 had some concerns about this.


pkg/sql/conn_executor_exec.go line 2009 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

I think planner.curPlan.close will not be called this way because we already entered the branch for the pausable portal case. What do you think about lifting this "revoke" block outside of makeExecPlan, one level up the stack? Then it'd be more clear what's going on, and it'll be easier to revoke pausability correctly.

Good catch. Hmm... if we move this block before the call to makeOptimizerPlan we can't use the flag, since the flag is set during that call. I did a small refactor to make sure we close correctly, but it isn't pretty.


pkg/sql/conn_executor_exec.go line 2010 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: should we also increment a telemetry counter (perhaps introduce a new one)?

We increment sqltelemetry.NotReadOnlyStmtsTriedWithPausablePortals in this block. That seems like the appropriate counter for statements containing mutations.


pkg/sql/sem/tree/stmt.go line 152 at r2 (raw file):

// Now we only allow SELECT, but is it too strict? And how to filter out
// SELECT with data writes / schema changes?
func IsAllowedToPause(stmt Statement) bool {

@ZhouXing19 by removing IsAllowedToPause we're revoking pausable portals for operators in IsMutationOp and IsDDLOp in pkg/sql/opt/operator.og.go instead of the original in operators from CanModifySchema and CanWriteData below. I think this should be ok, because the operators that are not included do not return any rows, and should therefore be able to be executed in a pausable portal (it won't pause). Could you please sanity check?

@ZhouXing19
Copy link
Collaborator

ZhouXing19 commented Aug 21, 2023

It should be ok to allow statements not returning rows for pausable portals, since with row == nil in DistSQLReceiver.Push(), limitedCommandResult.AddRow() will never be called.

The concern is that in PG for statements of this kind, after executing the statement, the portal will be destroyed and won't be allowed for re-execution: (You can always run the pgtests against pg with ./dev test pkg/sql/pgwire --filter='TestPGTest/jane_test' --rewrite --test-args='-addr=localhost:5432')

send
Query {"String": "BEGIN"}
Query {"String": "CREATE TABLE mytable (x int)"}
Parse {"Name": "q1", "Query": "ALTER TABLE mytable ADD COLUMN y int"}
Bind {"DestinationPortal": "p1", "PreparedStatement": "q1"}
Execute {"Portal": "p1", "MaxRows": 1}
Execute {"Portal": "p1", "MaxRows": 1}
Sync
----


until keepErrMessage
ReadyForQuery
ReadyForQuery
ErrorResponse
ReadyForQuery
----
{"Type":"CommandComplete","CommandTag":"BEGIN"}
{"Type":"ReadyForQuery","TxStatus":"T"}
{"Type":"CommandComplete","CommandTag":"CREATE TABLE"}
{"Type":"ReadyForQuery","TxStatus":"T"}
{"Type":"ParseComplete"}
{"Type":"BindComplete"}
{"Type":"CommandComplete","CommandTag":"ALTER TABLE"}
{"Type":"ErrorResponse","Code":"55000","Message":"portal \"p1\" cannot be run"}
{"Type":"ReadyForQuery","TxStatus":"E"}

However, our current implementation will keep the portal and the underlying flow (also concerned about if there's any hazard for keeping the flow unreleased for a long time, maybe @yuzefovich knows more about this?) :

send
Query {"String": "BEGIN"}
Query {"String": "CREATE TABLE mytable (x int)"}
Parse {"Name": "q1", "Query": "ALTER TABLE mytable ADD COLUMN y int"}
Bind {"DestinationPortal": "p1", "PreparedStatement": "q1"}
Execute {"Portal": "p1", "MaxRows": 1}
Execute {"Portal": "p1", "MaxRows": 1}
Sync
----


until keepErrMessage
ReadyForQuery
ReadyForQuery
ReadyForQuery
----
{"Type":"CommandComplete","CommandTag":"BEGIN"}
{"Type":"ReadyForQuery","TxStatus":"T"}
{"Type":"CommandComplete","CommandTag":"CREATE TABLE"}
{"Type":"ReadyForQuery","TxStatus":"T"}
{"Type":"ParseComplete"}
{"Type":"BindComplete"}
{"Type":"CommandComplete","CommandTag":"ALTER TABLE"}
{"Type":"CommandComplete","CommandTag":"ALTER TABLE"}
{"Type":"ReadyForQuery","TxStatus":"E"}

I think the easiest way to accommodate this group of statements could be to modify the CreateStatementResult() to ignore the limit clause and create a normal commandResult for them. (i.e. implicitly make them un-pausable). Other ways I can think of at this moment could be to mark the receiver ready-to-be-closed in execinfra.Run(), so that the conn executor cleans up the portal at the end of the execution.

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also concerned about if there's any hazard for keeping the flow unreleased for a long time, maybe @yuzefovich knows more about this?

Indeed, that can be a concern in a sense that we could be leaking (more precisely, not releasing) some resources (like RAM usage) until the flow is cleaned up.

I think the easiest way to accommodate this group of statements could be to modify the CreateStatementResult() to ignore the limit clause and create a normal commandResult for them.

I like this idea - if we can determine at PREPARE time that the stmt is such that it won't produce any rows, we should not use the pausable portal execution model for it.

Reviewed 3 of 3 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rharding6373 and @ZhouXing19)


pkg/sql/conn_executor_exec.go line 2009 at r1 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

Good catch. Hmm... if we move this block before the call to makeOptimizerPlan we can't use the flag, since the flag is set during that call. I did a small refactor to make sure we close correctly, but it isn't pretty.

I didn't mean it to move this code block to be before makeOptimizerPlan - I was thinking it would be right after makeExecPlan is called in the pausable portal branch, something like this (I'll post it separate;y because reviewable messes up the formatting).


pkg/sql/conn_executor_exec.go line 2010 at r1 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

We increment sqltelemetry.NotReadOnlyStmtsTriedWithPausablePortals in this block. That seems like the appropriate counter for statements containing mutations.

Indeed, I missed it, nvm.


pkg/sql/conn_executor_exec.go line 1761 at r2 (raw file):

			// We don't allow sub / post queries for pausable portal. Set it back to an
			// un-pausable (normal) portal.
			// With pauseInfo is nil, no cleanup function will be added to the stack

I think we might have an existing bug here. Namely, we added planner.curPlan.close as the cleanup step, but now we're overriding pauseInfo to nil, so that deferred cleanup step won't be executed. Perhaps leave a TODO to investigated it.


pkg/sql/conn_executor_exec.go line 2173 at r2 (raw file):

		// paused either.
		if err := res.RevokePortalPausability(); err != nil {
			res.SetError(err)

I think we should be just returning the error as usual. In dispatchToExecutionEngine we don't return the error and only write it in res because the returned errors result in the connection shutdown, makeExecPlan doesn't have such a contract.

@yuzefovich
Copy link
Member

I was thinking something along these lines:

diff --git a/pkg/sql/conn_executor_exec.go b/pkg/sql/conn_executor_exec.go
index 80d598a1d9c..fde07a7604c 100644
--- a/pkg/sql/conn_executor_exec.go
+++ b/pkg/sql/conn_executor_exec.go
@@ -1632,11 +1632,20 @@ func (ex *connExecutor) dispatchToExecutionEngine(
        if ppInfo := getPausablePortalInfo(); ppInfo != nil {
                if !ppInfo.dispatchToExecutionEngine.cleanup.isComplete {
                        err = ex.makeExecPlan(ctx, planner)
-                       ppInfo.dispatchToExecutionEngine.planTop = planner.curPlan
-                       ppInfo.dispatchToExecutionEngine.cleanup.appendFunc(namedFunc{
-                               fName: "close planTop",
-                               f:     func() { ppInfo.dispatchToExecutionEngine.planTop.close(ctx) },
-                       })
+                       if flags := planner.curPlan.flags; err == nil && (flags.IsSet(planFlagContainsMutation) || flags.IsSet(planFlagIsDDL)) {
+                               telemetry.Inc(sqltelemetry.NotReadOnlyStmtsTriedWithPausablePortals)
+                               // We don't allow mutations in a pausable portal. Set it back to
+                               // an un-pausable (normal) portal.
+                               planner.pausablePortal.pauseInfo = nil
+                               err = res.RevokePortalPausability()
+                               defer planner.curPlan.close(ctx)
+                       } else {
+                               ppInfo.dispatchToExecutionEngine.planTop = planner.curPlan
+                               ppInfo.dispatchToExecutionEngine.cleanup.appendFunc(namedFunc{
+                                       fName: "close planTop",
+                                       f:     func() { ppInfo.dispatchToExecutionEngine.planTop.close(ctx) },
+                               })
+                       }
                } else {
                        planner.curPlan = ppInfo.dispatchToExecutionEngine.planTop
                }

Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the improvement! should this be considered a "bug fix" in release notes? (if only master is affected, then no need)

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rharding6373 and @ZhouXing19)

@rharding6373 rharding6373 force-pushed the 20230808_portal_mutation_exec_107130 branch from fde1a75 to 814b494 Compare August 22, 2023 19:31
Copy link
Collaborator Author

@rharding6373 rharding6373 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only fixes master (my original motivation was disabling pausable portals for UDFs with mutations, which will be supported in 23.2) and does a lot of cleanup/refactoring.

I implemented Jane's suggestion to override the limit when creating statement results, PTAL!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich and @ZhouXing19)


pkg/sql/conn_executor_exec.go line 2009 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

I didn't mean it to move this code block to be before makeOptimizerPlan - I was thinking it would be right after makeExecPlan is called in the pausable portal branch, something like this (I'll post it separate;y because reviewable messes up the formatting).

Thanks for the suggestion, that's much cleaner.


pkg/sql/conn_executor_exec.go line 1761 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

I think we might have an existing bug here. Namely, we added planner.curPlan.close as the cleanup step, but now we're overriding pauseInfo to nil, so that deferred cleanup step won't be executed. Perhaps leave a TODO to investigated it.

Done.


pkg/sql/conn_executor_exec.go line 2173 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

I think we should be just returning the error as usual. In dispatchToExecutionEngine we don't return the error and only write it in res because the returned errors result in the connection shutdown, makeExecPlan doesn't have such a contract.

Fixed and moved.

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rharding6373 and @ZhouXing19)


pkg/sql/conn_executor_exec.go line 1641 at r3 (raw file):

				// an un-pausable (normal) portal.
				planner.pausablePortal.pauseInfo = nil
				if err := res.RevokePortalPausability(); err != nil {

We now have the revoke call twice. Let's remove this one (that contains return inside of it) - we don't want to return right away to follow how we're processing the error in non-pausable portal branch (we let the code continue and do error handling some lines below).


pkg/sql/conn_executor_exec.go line 2154 at r3 (raw file):

// should check if it still exists when this function returns.
func (ex *connExecutor) makeExecPlan(
	ctx context.Context, planner *planner, res RestrictedCommandResult,

nit: changes to the comment and the signature are no longer needed.

@rharding6373 rharding6373 force-pushed the 20230808_portal_mutation_exec_107130 branch from 814b494 to 0c3d1d4 Compare August 22, 2023 23:02
Copy link
Collaborator Author

@rharding6373 rharding6373 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich and @ZhouXing19)


pkg/sql/conn_executor_exec.go line 1641 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

We now have the revoke call twice. Let's remove this one (that contains return inside of it) - we don't want to return right away to follow how we're processing the error in non-pausable portal branch (we let the code continue and do error handling some lines below).

Copy paste error. Thanks for the catch.


pkg/sql/conn_executor_exec.go line 2154 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: changes to the comment and the signature are no longer needed.

Done.

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working through this! :lgtm:

Reviewed 1 of 1 files at r4, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ZhouXing19)


pkg/sql/conn_executor_exec.go line 2154 at r3 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

Done.

nit: the callsites need to be reverted accordingly too.

@rharding6373 rharding6373 force-pushed the 20230808_portal_mutation_exec_107130 branch from 0c3d1d4 to 652f2a3 Compare August 23, 2023 15:58
@ZhouXing19
Copy link
Collaborator

also concerned about if there's any hazard for keeping the flow unreleased for a long time, maybe @yuzefovich knows more about this?

Indeed, that can be a concern in a sense that we could be leaking (more precisely, not releasing) some resources (like RAM usage) until the flow is cleaned up.

Not for this PR, but should we therefore keep a cache for pausable portals, and clean up portals when the pausable portal_cache_size is reached, similar to prepStmtsLRU?

@yuzefovich
Copy link
Member

Indeed, that can be a concern in a sense that we could be leaking (more precisely, not releasing) some resources (like RAM usage) until the flow is cleaned up.

Not for this PR, but should we therefore keep a cache for pausable portals, and clean up portals when the pausable portal_cache_size is reached, similar to prepStmtsLRU?

Flow objects are not reusable (i.e. we cannot reset it after one execution and use it again for the next execution). This is a pretty fundamental assumption that I doubt we'll want to change for pausable portals execution model.

@ZhouXing19
Copy link
Collaborator

ZhouXing19 commented Aug 23, 2023

Flow objects are not reusable (i.e. we cannot reset it after one execution and use it again for the next execution). This is a pretty fundamental assumption that I doubt we'll want to change for pausable portals execution model.

This makes sense, but I was thinking more about auto-cleanup the whole portal and its execution dependency (in contrast to cleaning up just the underlying flow but keeping the portal object) to prevent the resource leaking case that you mentioned. But this might not be urgent at this moment.

Previously we examined the AST to determine whether a statement could be
executed in a pausable portal or not. However, this was insufficient to
identify volatile UDFs that could also contain mutations. We also
revoked pausable portals for operators for whom the portal type didn't
matter, since they would be executed to completion regardless.

This PR revokes a portal's pausability if the statement's plan contains
a mutation or DDL. That is, if the opt builder determines that any
operator is a mutation operator (see `IsMutationOp` and `IsDDLOp`).

Epic: None
Fixes: cockroachdb#107130

Release note: None
@rharding6373 rharding6373 force-pushed the 20230808_portal_mutation_exec_107130 branch from 652f2a3 to 1a193fc Compare August 23, 2023 20:07
@rharding6373
Copy link
Collaborator Author

TFTRs!

auto-cleanup the whole portal and its execution dependency (in contrast to cleaning up just the underlying flow but keeping the portal object) to prevent the resource leaking case that you mentioned.

@ZhouXing19 Could you open an issue for this to describe the improvements you envision? I don't think adding a TODO in the code here would do it justice.

@rharding6373
Copy link
Collaborator Author

bors r+

@craig
Copy link
Contributor

craig bot commented Aug 24, 2023

Build succeeded:

@craig craig bot merged commit 7437f5e into cockroachdb:master Aug 24, 2023
@rharding6373 rharding6373 deleted the 20230808_portal_mutation_exec_107130 branch August 24, 2023 20:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql: prevent pausable portals for queries that call volatile UDFs
5 participants