Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: update connExecutor logic for pausable portals #99663

Merged
merged 8 commits into from
Apr 7, 2023

Conversation

ZhouXing19
Copy link
Collaborator

@ZhouXing19 ZhouXing19 commented Mar 27, 2023

This PR replaces #96358 and is part of the initial implementation of multiple active portals.


This PR is to add limited support for multiple active portals. Now portals satisfying all following restrictions can be paused and resumed (i.e., with other queries interleaving it):

  1. Not an internal query;
  2. Read-only query;
  3. No sub-queries or post-queries.

And such a portal will only have the statement executed with a non-distributed plan.

This feature is gated by a session variable multiple_active_portals_enabled. When it's set true, all portals that satisfy the restrictions above will automatically become "pausable" when being created via the pgwire Bind stmt.

The core idea of this implementation is

  1. Add a switchToAnotherPortal status to the result-consumption state machine. When we receive an ExecPortal message for a different portal, we simply return the control to the connExecutor. (sql: add switchToAnotherPortal signal for result consumer #99052)
  2. Persist flow queryID span and instrumentationHelper for the portal, and reuse it when we re-execute a portal. This is to ensure we continue the fetching rather than starting all over. (sql: enable resumption of a flow for pausable portals #99173)
  3. To enable 2, we need to delay the clean-up of resources till we close the portal. For this we introduced the stacks of cleanup functions. (This PR)

Note that we kept the implementation of the original "un-pausable" portal, as we'd like to limit this new functionality only to a small set of statements. Eventually some of them should be replaced (e.g. the limitedCommandResult's lifecycle) with the new code.

Also, we don't support distributed plan yet, as it involves much more complicated changes. See Start with an entirely local plan section in the design doc. Support for this will come as a follow-up.

Epic: CRDB-17622

Release note (sql change): initial support for multiple active portals. Now with session variable multiple_active_portals_enabled set to true, portals satisfying all following restrictions can be executed in an interleaving manner: 1. Not an internal query; 2. Read-only query; 3. No sub-queries or post-queries. And such a portal will only have the statement executed with an entirely local plan.

@ZhouXing19 ZhouXing19 requested review from rafiss, yuzefovich and a team March 27, 2023 15:12
@ZhouXing19 ZhouXing19 requested review from a team as code owners March 27, 2023 15:12
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I think I have the last batch of comments and questions.

Reviewed 8 of 8 files at r1, 5 of 5 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rafiss and @ZhouXing19)


pkg/sql/conn_executor.go line 2400 at r1 (raw file):

		// portal. However, we update the result for each execution. Thus, we need
		// to accumulate the number of affected rows before closing the result.
		switch tcmd := cmd.(type) {

nit: since we only have a single case, then probably if would be better than switch.


pkg/sql/conn_executor_exec.go line 208 at r1 (raw file):

	pinfo *tree.PlaceholderInfo,
	canAutoCommit bool,
) (ev fsm.Event, payload fsm.EventPayload, err error) {

nit: since we're inspecting this error in the defer, it'd be better to do s/err/retErr/g.


pkg/sql/conn_executor_exec.go line 443 at r1 (raw file):

			// in that jungle, we just overwrite them all here with an error that's
			// nicer to look at for the client.
			// Explaining why we need to check if the result has been released:

How do portals differ from the regular queries? I.e. is it something about limitedCommandResult that makes res to be auto-released in some cases?


pkg/sql/conn_executor_exec.go line 448 at r1 (raw file):

			// connExecutor. We should allow this case, so don't want to return an
			// assertion error in resToPushErr.Err() when the result is released.
			if res != nil && cancelQueryCtx.Err() != nil && !res.IsReleased() && res.Err() != nil {

Would it be cleaner to change limitedCommandResult a bit to avoid these !IsReleased() checks? In particular, I'm thinking that we'd need add implementations of Err and SetError methods to limitedCommandResult - Err would be valid to be called even after Close has been called (even when commandResult.released is true) and SetError would store the error in limitedCommandResult directly (in addition to also storing it in the wrapped commandResult).


pkg/sql/conn_executor_exec.go line 630 at r1 (raw file):

				// the correct instrumentation helper for the paused portal.
				ihToFinish := ih
				if isPausablePortal && portal.pauseInfo.ihWrapper != nil {

nit: the second part of the conditional here is redundant, right? I.e. ihWrapper must always be non-nil here.


pkg/sql/conn_executor_exec.go line 871 at r1 (raw file):

		p.pausablePortal = portal
	} else {
		p.cancelChecker.Reset(ctx)

nit: I'd just remove this call altogether (this resetting is done in ex.resetPlanner above).


pkg/sql/conn_executor_exec.go line 1519 at r2 (raw file):

	// We only allow non-distributed plan for pausable portals.
	if planner.pausablePortal != nil {
		distSQLMode = sessiondatapb.DistSQLOff

We can perform the check whether curPlan has any subqueries or postqueries here, before disabling DistSQL. This will make it so that if the stmt is not supported for pausable portals, we still execute the stmt with normal DistSQL mode.


pkg/sql/distsql_running.go line 1643 at r1 (raw file):

	if p := planCtx.getPortalPauseInfo(); p != nil {
		if buildutil.CrdbTestBuild && planCtx.getPortalPauseInfo().flow == nil {

nit: can use p here.


pkg/sql/pgwire/testdata/pgtest/multiple_active_portals line 0 at r1 (raw file):
nit: seems like an empty file in the first commit.


pkg/sql/pgwire/testdata/pgtest/multiple_active_portals line 396 at r3 (raw file):


send
Query {"String": "COMMIT"}

nit: let's read a couple more rows from p1 here before committing.


pkg/sql/pgwire/testdata/pgtest/multiple_active_portals line 649 at r3 (raw file):


send
Query {"String": "SET statement_timeout='100s'"}

nit: why 100s? Do you want to disable the timeout here?


pkg/sql/pgwire/testdata/pgtest/multiple_active_portals line 690 at r3 (raw file):

{"Type":"ReadyForQuery","TxStatus":"T"}

send crdb_only

nit: some comments for what we're trying to check here would be good.


pkg/sql/pgwire/testdata/pgtest/multiple_active_portals line 761 at r3 (raw file):

Bind {"DestinationPortal": "p1", "PreparedStatement": "q1"}
Execute {"Portal": "p1", "MaxRows": 1}
Query {"String": "SET CLUSTER SETTING sql.pgwire.multiple_active_portals.enabled = false"}

nit: do you just want to poison the txn with a non-retriable error here? Perhaps crdb_internal.force_error would be cleaner?

@ZhouXing19 ZhouXing19 force-pushed the multiple-active-portal-0327 branch from 13871ad to 43d4f3f Compare March 28, 2023 00:12
Copy link
Collaborator Author

@ZhouXing19 ZhouXing19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing!
In our pair programming last week, you mentioned that it's preferred to have a session variable to gate multiple active portals. Just wanted to confirm if we want to replace the cluster setting with the session var, or have them both.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rafiss and @yuzefovich)


pkg/sql/conn_executor_exec.go line 443 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

How do portals differ from the regular queries? I.e. is it something about limitedCommandResult that makes res to be auto-released in some cases?

Not really -- we close the result at the end of each execution at connExecutor.execCmd():

res.Close(ctx, stateToTxnStatusIndicator(ex.machine.CurState()))


pkg/sql/conn_executor_exec.go line 448 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Would it be cleaner to change limitedCommandResult a bit to avoid these !IsReleased() checks? In particular, I'm thinking that we'd need add implementations of Err and SetError methods to limitedCommandResult - Err would be valid to be called even after Close has been called (even when commandResult.released is true) and SetError would store the error in limitedCommandResult directly (in addition to also storing it in the wrapped commandResult).

Good idea, but I'm not sure why we can't use the error in the wrapped commandResult directly?


pkg/sql/conn_executor_exec.go line 1519 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

We can perform the check whether curPlan has any subqueries or postqueries here, before disabling DistSQL. This will make it so that if the stmt is not supported for pausable portals, we still execute the stmt with normal DistSQL mode.

Done.


pkg/sql/pgwire/testdata/pgtest/multiple_active_portals line 649 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: why 100s? Do you want to disable the timeout here?

Yeah 😅 changed it to RESET

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our pair programming last week, you mentioned that it's preferred to have a session variable to gate multiple active portals. Just wanted to confirm if we want to replace the cluster setting with the session var, or have them both.

I think both would be good, but also it seems like we're discouraging introduction of new cluster settings when session variables would suffice, so maybe just the session var is the right approach. I'll defer to Rafi on this one.

Reviewed 8 of 8 files at r4, 6 of 6 files at r5, 1 of 1 files at r6, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rafiss and @ZhouXing19)


pkg/sql/conn_executor_exec.go line 443 at r1 (raw file):

Previously, ZhouXing19 (Jane Xing) wrote…

Not really -- we close the result at the end of each execution at connExecutor.execCmd():

res.Close(ctx, stateToTxnStatusIndicator(ex.machine.CurState()))

Ok, I see what's going on now. We can hit the panic that the result has been released when Sync is reached in an implicit transaction - at that point we have called res.Close (because the execution of the previous pgwire command was done) and we haven't created a new result

ev, payload = ex.handleAutoCommit(ctx, &tree.CommitTransaction{})
.
We're violating CommandResultClose.Close contract which says that no new calls after Close on the result are allowed, but I think it's ok to make an exception here. It'd be good to explain this in the comment on limitedCommandResult.Err.


pkg/sql/conn_executor_exec.go line 448 at r1 (raw file):

Previously, ZhouXing19 (Jane Xing) wrote…

Good idea, but I'm not sure why we can't use the error in the wrapped commandResult directly?

Yeah, you're right, we can just use commandResult.err directly. We also then don't need to add limitedCommandResult.SetError method since we're only calling Err after Close.


pkg/sql/conn_executor_exec.go line 1621 at r6 (raw file):

			fName: "populate query level stats and regions",
			f: func() {
				populateQueryLevelStatsAndRegions(ctx, &curPlanner, ex.server.cfg, &stats, &ex.cpuStatsCollector)

Here and below stats is captured incorrectly - we need to use ppInfo.queryStats.


pkg/sql/conn_executor_exec.go line 1624 at r6 (raw file):

				stmtFingerprintID = ex.recordStatementSummary(
					ctx, &curPlanner,
					int(ex.state.mu.autoRetryCounter), ppInfo.rowsAffected, res.Err(), stats,

This res is the result that was created on the first execution which is incorrect. We should store the "current" result in pauseInfo, update it on each re-execution, and use the latest result here.


pkg/sql/pgwire/command_result.go line 433 at r6 (raw file):

// Postgres protocol.
//
// This design is known to be flawed. It only supports a specific subset of the

nit: this comment is not out-of-date.


pkg/sql/pgwire/command_result.go line 671 at r6 (raw file):

}

// Err is part of the sql.RestrictedCommandResult interface.

nit: let's mention in the comment why we need this method. In particular, we need to allow Err calls after Close has been called in implicit txns.

@ZhouXing19 ZhouXing19 force-pushed the multiple-active-portal-0327 branch 2 times, most recently from 650ad1b to 3589eee Compare March 28, 2023 05:58
Copy link
Collaborator Author

@ZhouXing19 ZhouXing19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rafiss and @yuzefovich)


-- commits line 48 at r15:
I will make a separate PR to add a test for this. I'm thinking about having a field inExecutorTestingKnobs that forces an error when resuming a portal, and ensures that we don't increment the number of executed stmts.


pkg/sql/conn_executor_exec.go line 401 at r10 (raw file):

				"increment executed stmt cnt",
				func() {
					if retErr == nil && !payloadHasError(retPayload) {

Addressed all comments.
@yuzefovich Inspired by your comment on retaining the "current" res in pauseInfo, I think this was wrong, so made another commit sql: store retErr and retPayload for pausable portals. The retErr here should be the one evaluated when this cleanup is executed, rather than when this it is created.

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 9 of 9 files at r11, 6 of 6 files at r12, 2 of 2 files at r13, 5 of 5 files at r14, 1 of 1 files at r15, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rafiss and @ZhouXing19)


pkg/sql/conn_executor_exec.go line 401 at r10 (raw file):

Previously, ZhouXing19 (Jane Xing) wrote…

Addressed all comments.
@yuzefovich Inspired by your comment on retaining the "current" res in pauseInfo, I think this was wrong, so made another commit sql: store retErr and retPayload for pausable portals. The retErr here should be the one evaluated when this cleanup is executed, rather than when this it is created.

Good catch, we need to be careful with all of these deferred cleanup functions.


pkg/sql/conn_executor_exec.go line 437 at r12 (raw file):

				cancelQueryFunc = portal.pauseInfo.cancelQueryFunc
			}
			var resToPushErr RestrictedCommandResult

nit: you could combine these two lines into one resToPushErr := res.


pkg/sql/conn_executor_exec.go line 487 at r13 (raw file):

				retPayload = eventNonRetriableErrPayload{err: errToPush}
				if isPausablePortal {
					portal.pauseInfo.retPayload = retPayload

Shouldn't the defer at the top of the method handle this for us? increment executed stmt cnt deferred cleanup function should be executed after we update pauseInfo with the latest retErr and retPayload. I think that defer is the only place where we need to modify pauseInfo for this.


pkg/sql/pgwire/command_result.go line 671 at r6 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: let's mention in the comment why we need this method. In particular, we need to allow Err calls after Close has been called in implicit txns.

Let's add a comment on Err for future readers.


pkg/sql/pgwire/command_result.go line 649 at r12 (raw file):

// SetError is part of the sql.RestrictedCommandResult interface.
func (r *limitedCommandResult) SetError(err error) {

nit: we shouldn't need this method, right?

@ZhouXing19 ZhouXing19 force-pushed the multiple-active-portal-0327 branch from 3589eee to b05b25f Compare March 28, 2023 22:10
Copy link
Collaborator Author

@ZhouXing19 ZhouXing19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chatted with Rafi today, and we agreed that session var is better. Will add a commit for that shortly.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rafiss and @yuzefovich)


pkg/sql/conn_executor_exec.go line 487 at r13 (raw file):

increment executed stmt cnt deferred cleanup function should be executed after we update pauseInfo with the latest retErr and retPayload.

Right, but this part of cleanup is also called after we update pauseInfo with the latest retErr and retPayload as well.

Note that this part is in the closure of processCleanupFunc, so it's part of the cancel query cleanup function. We trigger the cleanup from the first defer in execStmtInOpenState(), which happens after pauseInfo.retPayload is updated.

In summary, the execution order is is:

  1. update pauseInfo with the latest retErr and retPayload
  2. cleanup stack is triggered
  3. cancel query cleanup func
  4. increment executed stmt cnt

pkg/sql/pgwire/command_result.go line 671 at r6 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Let's add a comment on Err for future readers.

Ah sorry for the confusion, a fixup seems missed -- added now.

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 5 files at r18, 1 of 1 files at r19, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rafiss and @ZhouXing19)


pkg/sql/conn_executor_exec.go line 487 at r13 (raw file):

Previously, ZhouXing19 (Jane Xing) wrote…

increment executed stmt cnt deferred cleanup function should be executed after we update pauseInfo with the latest retErr and retPayload.

Right, but this part of cleanup is also called after we update pauseInfo with the latest retErr and retPayload as well.

Note that this part is in the closure of processCleanupFunc, so it's part of the cancel query cleanup function. We trigger the cleanup from the first defer in execStmtInOpenState(), which happens after pauseInfo.retPayload is updated.

In summary, the execution order is is:

  1. update pauseInfo with the latest retErr and retPayload
  2. cleanup stack is triggered
  3. cancel query cleanup func
  4. increment executed stmt cnt

Hm, I see, this seems very fragile. Perhaps it might be worth introducing a couple of helper methods like

updateRetPayload := func(payload fsm.EventPayload) {
  retPayload = payload
  if isPausablePortal {
    portal.pauseInfo.retPayload = payload
  }
}

and the same for retErr to make it explicit that all updates to retPayload and retErr must also update pause info.

As the code written currently it seems very easy to introduce another place where retPayload or retErr is updated but forget to update the pause info.


pkg/sql/pgwire/command_result.go line 657 at r19 (raw file):

// Err is part of the sql.RestrictedCommandResult interface.
// Unlike commandResult.Err(), we don't assert the result is not release here.

nit: s/release/released/g.

Copy link
Collaborator Author

@ZhouXing19 ZhouXing19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rafiss and @yuzefovich)


pkg/sql/conn_executor_exec.go line 487 at r13 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Hm, I see, this seems very fragile. Perhaps it might be worth introducing a couple of helper methods like

updateRetPayload := func(payload fsm.EventPayload) {
  retPayload = payload
  if isPausablePortal {
    portal.pauseInfo.retPayload = payload
  }
}

and the same for retErr to make it explicit that all updates to retPayload and retErr must also update pause info.

As the code written currently it seems very easy to introduce another place where retPayload or retErr is updated but forget to update the pause info.

Yeah, it is confusing ... Essentially this is for the case where an error happens at a cleanup step, and we need to ensure all cleanup steps after it are able to see this error.
To do this, I have updateRetErrAndPayload(retErr, retPayload) always called at the end of each cleanup step (I will add test for this in another PR). Does this make sense to you?

@ZhouXing19 ZhouXing19 force-pushed the multiple-active-portal-0327 branch 3 times, most recently from dee1dba to 9a6b91c Compare March 29, 2023 22:35
@ZhouXing19
Copy link
Collaborator Author

Failed CI is because of a slow quiesce error that seems irrelevant (couldn't repro it locally).

Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @yuzefovich and @ZhouXing19)


pkg/sql/vars.go line 2667 at r60 (raw file):

			return formatBoolAsPostgresSetting(evalCtx.SessionData().MultipleActivePortalsEnabled), nil
		},
		GlobalDefault: globalFalse,

are you still planning to make this default value be a metamorphic constant? (can be in a later PR too)

@ZhouXing19 ZhouXing19 force-pushed the multiple-active-portal-0327 branch from 118c710 to 390bdc9 Compare April 4, 2023 15:42
Copy link
Collaborator Author

@ZhouXing19 ZhouXing19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @rafiss and @yuzefovich)


pkg/sql/vars.go line 2667 at r60 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

are you still planning to make this default value be a metamorphic constant? (can be in a later PR too)

Updated it to be a metamorphic constant

@ZhouXing19 ZhouXing19 force-pushed the multiple-active-portal-0327 branch from 390bdc9 to 6e0e5e2 Compare April 4, 2023 16:15
Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! very impressive work

i've just left two nits which can be addressed easily

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @yuzefovich and @ZhouXing19)


pkg/sql/distsql_running.go line 1648 at r60 (raw file):

	if p := planCtx.getPortalPauseInfo(); p != nil {
		if buildutil.CrdbTestBuild && p.resumableFlow.flow == nil {

is this CrdbTestBuild check intentional?


pkg/sql/prepared_stmt.go line 189 at r60 (raw file):

	// TODO(sql-session): address the compatibility of pausable portals and
	// prepared_statements_cache_size.

it's fine to leave as a todo - but what is the behavior right now? the cache is ignored? (we can put these details into #99959 since the issue is already linked here)

@ZhouXing19
Copy link
Collaborator Author

@yuzefovich @rafiss thanks for reviewing!

Following suggestions from Rafi, I made the session var a metamorphic constant, and this revealed quite some bugs in the CI :) To make it easier for review I made a separate PR #100851 for the fix, and after getting approval I'll close #100851 and have changes in this PR. Could you give another round of review on it?

Thanks!!

@ZhouXing19 ZhouXing19 force-pushed the multiple-active-portal-0327 branch 2 times, most recently from 1dd6c97 to 0fc7a6c Compare April 7, 2023 02:19
With the introduction of pausable portals, the comment for `limitedCommandResult`
needs to be updated to reflect the current behavior.

Release note: None
This change introduces a new session variable for a preview feature. When set to `true`,
all non-internal portals with read-only [`SELECT`](../v23.1/selection-queries.html)
queries without sub-queries or post-queries can be paused and resumed in an interleaving
manner, but are executed with a local plan.

Release note (SQL change): Added the session variable `multiple_active_portals_enabled`.
This setting is only for a preview feature. When set to `true`, it allows
multiple portals to be open at the same time, with their execution interleaved
with each other. In other words, these portals can be paused. The underlying
statement for a pausable portal must be a read-only `SELECT` query without
sub-queries or post-queries, and such a portal is always executed with a local
plan.
@ZhouXing19 ZhouXing19 force-pushed the multiple-active-portal-0327 branch from 0fc7a6c to 90d8ba9 Compare April 7, 2023 12:12
…e persistence

This commit is part of the implementation of multiple active portals. In order to
execute portals interleavingly, certain resources need to be persisted and their
clean-up must be delayed until the portal is closed. Additionally, these resources
don't need to be re-setup when the portal is re-executed.

To achieve this, we store the cleanup steps in the `cleanup` function stacks in
`portalPauseInfo`, and they are called when any of the following events occur:

1. SQL transaction is committed or rolled back
2. Connection executor is closed
3. An error is encountered when executing the portal
4. The portal is explicited closed by the user

The cleanup functions should be called in the original order of a normal portal.
Since a portal's execution follows the `execPortal() -> execStmtInOpenState() ->
dispatchToExecutionEngine() -> flow.Run()` function flow, we categorize the cleanup
functions into 4 "layers", which are stored accordingly in `PreparedPortal.pauseInfo`.
The cleanup is always LIFO, following the

- resumableFlow.cleanup
- dispatchToExecutionEngine.cleanup
- execStmtInOpenState.cleanup
- exhaustPortal.cleanup`

order. Additionally, if an error occurs in any layer, we clean up the current and
proceeding layers. For example, if an error occurs in `execStmtInOpenState()`, we
perform `resumableFlow.cleanup` and `dispatchToExecutionEngine.cleanup` (proceeding)
and then `execStmtInOpenState.cleanup` (current) before returning the error to
`execPortal()`, where `exhaustPortal.cleanup` will eventually be called. This is to
maintain the previous clean-up process for portals as much as possible.

We also pass the `PreparedPortal` as a reference to the planner in
`execStmtInOpenState()`,so that the portal's flow can be set and reused.

Release note: None
When executing or cleaning up a pausable portal, we may encounter an error and
need to run the corresponding clean-up steps, where we need to check the latest
`retErr` and `retPayload` rather than the ones evaluated when creating the
cleanup functions.

To address this, we use portal.pauseInfo.retErr and .retPayload to record the
latest error and payload. They need to be updated for each execution.

Specifically,

1. If the error happens during portal execution, we ensure `portal.pauseInfo`
records the error by adding the following code to the main body of
`execStmtInOpenState()`:

``` go
defer func() {
	updateRetErrAndPayload(retErr, retPayload)
}()
```

Note this defer should always happen _after_ the defer of running the cleanups.

2. If the error occurs during a certain cleanup step for the pausable portal,
we ensure that cleanup steps after it can see the error by always having
`updateRetErrAndPayload(retErr, retPayload)` run at the end of each cleanup step.

Release note: None
This commit adds several restrictions to pausable portals to ensure that they
work properly with the current changes to the consumer-receiver model.
Specifically, pausable portals must meet the following criteria:

1. Not be internal queries
2. Be read-only queries
3. Not contain sub-queries or post-queries
4. Only use local plans

These restrictions are necessary because the current changes to the
consumer-receiver model only consider the local push-based case.

Release note: None
When resuming a portal, we always reset the planner. However we still need the
planner to respect the outer txn's situation, as we did in cockroachdb#98120.

Release note: None
We now only support multiple active portals with local plan, so explicitly
disable it for this test for now.

Release note: None
@ZhouXing19 ZhouXing19 force-pushed the multiple-active-portal-0327 branch from 90d8ba9 to 199b177 Compare April 7, 2023 13:38
Copy link
Collaborator Author

@ZhouXing19 ZhouXing19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @rafiss and @yuzefovich)


pkg/sql/distsql_running.go line 1648 at r60 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

is this CrdbTestBuild check intentional?

Yes, it was suggested in #96358 (review)


pkg/sql/prepared_stmt.go line 189 at r60 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

it's fine to leave as a todo - but what is the behavior right now? the cache is ignored? (we can put these details into #99959 since the issue is already linked here)

I just realized that this is not an issue -- in postgres, deallocating a prepared statement doesn't impact the execution of a portal, and we're doing the same already. I added the subtest deallocating_prepared_stmt_should_not_interrupt_portal_execution for this as well.
I removed this comment and will close #99959.

@ZhouXing19
Copy link
Collaborator Author

I have merged the fixes from #100851.
Thanks so much for the review!!!
bors r+

@craig craig bot merged commit 22ab7ed into cockroachdb:master Apr 7, 2023
@craig
Copy link
Contributor

craig bot commented Apr 7, 2023

Build succeeded:

@ZhouXing19
Copy link
Collaborator Author

blathers backport 23.1

@blathers-crl
Copy link

blathers-crl bot commented Apr 7, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 26da6a7 to blathers/backport-release-23.1-99663: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1 failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants