sql/importer: more controlled shutdown during job cancellation #91615

stevendanna · 2022-11-09T17:33:42Z

Previously, we passed the import Resumer's context directly to our
DistSQLReceiver and to (*sql.DistSQLPlanner).Run. This context is
canceled when the user cancels or pauses a job. In practice, this
setup made it very common for dsp.Run to return before the processors
have shut down.

Here, we create a separate context for the distsql flow. When the
Resumer's context is canceled, we SetError on the DistSQLReceiver
which will transition the receiver to DrainRequested which will be
propagated to remote processors. Eventually, this leads to all remote
processors exiting, and the entire flow shutting down.

Note that the propagation of the draining status happens when a
message is actually pushed from processor. We push progress messages
from the import processors to the distsql receivers every 10 seconds
or so.

To protect against waiting too long, we explicitly cancel the flow
after a timeout.

Further, previously un-managed goroutines in the import processors are
now explicitly managed in a context group that we wait on during
shutdown, similar to other job-related processors.

Overall, in the included test, this substantially reduces the
frequency at which we see import processors outliving the running job.

Epic: None

Release note: None

cockroach-teamcity · 2022-11-09T17:33:55Z

This change is

stevendanna · 2022-11-09T17:41:07Z

First commit is #91563

msbutler

looks really good. Left mostly questions about docs.

pkg/sql/importer/import_processor_planning.go

pkg/sql/importer/import_processor.go

msbutler · 2022-11-11T16:07:05Z

pkg/sql/importer/import_processor.go

+		select {
+		case progCh <- prog:
+		case <-ctx.Done():
+		}


we spoke about an idea where a remote processor would shut itself down if it can't push to the coordinator. That's not implemented in here, correct?

The import processor already attempts to push a progress meta periodically. As part of that periodic push, we should see the downstream status, so I don't think we actually need an additional push loop.

but if the processor can't communicate the with the coordinator it will not receive a downstream status, i.e. a flowCtx cancellation, right? I thought we spoke about the need for the import processor to shut itself down if it has lost connection with corrdinator.

I'm not opposed to adding more if needed. But, I think the following is how the code stands currently:

Every 10s, we push an update to the progress channel. That update will be available to the next caller of Next():

cockroach/pkg/sql/importer/import_processor.go

Line 194 in a8b0cd9

for prog := range idp.progCh {

cockroach/pkg/sql/importer/import_processor.go

Lines 474 to 488 in a8b0cd9

g.GoCtx(func(ctx context.Context) error {

tick := time.NewTicker(time.Second * 10)

defer tick.Stop()

done := ctx.Done()

for {

select {

case <-done:

return ctx.Err()

case <-stopProgress:

return nil

case <-tick.C:

pushProgress()

}

}

})

Next() is actually called in a loop in Run():

https://github.com/cockroachdb/cockroach/blob/master/pkg/sql/execinfra/base.go#L186-L204

In this loop we Push() on our destination. I believe this will eventually make its way to the Outbox which is what sends data to the remote node. If that Outbox encounters an error because it can't talk to the remote node, it should result in our processor seeing both a context cancellation and a call to ConsumerClosed.

Thanks for this detailed explanation!! I took a closer look at the DistSQLReceiver.Push() function, and I think i've convinced myself that it will handle communication errors correctly, at least on master:

If the import processor attempts to send a row and the receiver's status is NeedMoreRows, we do seem to handle communication errors, and update the status to draing/consumer closed.

https://github.com/yuzefovich/cockroach/blob/0c1095e31cf93ea7f177f8bf1750ebab188e02d7/pkg/sql/distsql_running.go#L1335-1339

when we push metadata however, i don't think communication errors are handled. Maybe this isn't a problem, as we're only sending metadata after the receiver's status is no longer NeedMoreRows.

https://github.com/yuzefovich/cockroach/blob/0c1095e31cf93ea7f177f8bf1750ebab188e02d7/pkg/sql/distsql_running.go#L1176

I'd be curious to chat with yahor if he thinks previous versions of crdb handled comm errors like master does.

Also, you mentioned this implementation only reduces the chances of a race. Do you have an understanding of what sorts of situations can slip past these guardrails?

spoke offline. Here's the scenario we're worried about:

t0: remote sends a progress update, with connection
t1: remote silently looses connection with coordinator
t2: remote sends an addsstable request, without realizing it has gone rogue
t3: coordinator sends cancel request, waits for X seconds but gets crickets from remote, and cancels the flowctx
t4: remote finally receives add sstable response, but damage is done, b/c on fail or cancel has begun

We're unsure if a remote addstable request timeout could solve this problem.

pkg/sql/importer/import_stmt_test.go

pkg/sql/importer/import_processor_planning.go

msbutler

Thanks for answering all my questions!

stevendanna · 2022-11-23T08:25:25Z

@yuzefovich This is in part what we chatted about in Slack re processor shutdown. I wonder if you might have a few minutes to take a look.

yuzefovich

Looks good to me and is already an improvement, but I do have one suggestion to consider.

Reviewed 2 of 4 files at r2, 4 of 4 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @msbutler, @rhu713, and @stevendanna)

pkg/sql/importer/import_processor.go line 497 at r2 (raw file):

Previously, msbutler (Michael Butler) wrote…

spoke offline. Here's the scenario we're worried about:

t0: remote sends a progress update, with connection
t1: remote silently looses connection with coordinator
t2: remote sends an addsstable request, without realizing it has gone rogue
t3: coordinator sends cancel request, waits for X seconds but gets crickets from remote, and cancels the flowctx
t4: remote finally receives add sstable response, but damage is done, b/c on fail or cancel has begun

We're unsure if a remote addstable request timeout could solve this problem.

There is probably some confusion about what "communication error" is. DistSQLReceiver.handleCommErr is called only when we have issues communicating the results to the client (i.e. SQL CLI, ORM, etc). Communication errors "within" a distributed plan are handled differently - they eventually are pushed as metadata into DistSQLReceiver.pushMeta where we call DistSQLReceiver.SetError which will update the status of the receiver. In a scenario where the coordinator node loses the connection to one of the remote nodes, a communication error will be "generated" on the coordinator node in processInboundStreamHelper and will be pushed as metadata into the DistSQLReceiver too.

Just to spell things out a bit: when a remote node pushes a progress metadata object, eventually it'll arrive on the coordinator node in processProducerMessage. If the status of the receiver has been changed to "drain requested" since the last update, then we send an explicit drain signal to the remote. Also, there haven't been any significant changes around this machinery for a while in case you're considering a backport.

The scenario you describe seems feasible. However, it would depend on exactly how "remote silently looses connection" - if the gRPC stream is broken, then the outbox listener goroutine would get an error in listenForDrainSignalFromConsumer which would trigger the ungraceful shutdown of the whole flow.

One thing we could consider doing is rather than transitioning to DrainRequested we'd go straight to ConsumerClosed. DrainRequested is propagated passively, on the next push of the metadata by the remote nodes, but ConsumerClosed results in an eager propagation - we abruptly shutdown gRPC streams between nodes. This then results in the outboxes on the remote nodes canceling flowCtx on each node. As a whole, this would be an ungraceful termination, so we might need to teach the import processors to clean up in such cases, but it would be eager.

pkg/sql/importer/import_processor_planning.go line 339 at r3 (raw file):

// watch starts watching the context passed to newCancelWatcher for
// cancelation and notifies the given DistSQLReceiver when a

nit: s/cancelation/cancellation/.

pkg/sql/importer/import_stmt_test.go line 2083 at r3 (raw file):

// TestImportIntoCSVCancel cancels a distributed import. This test
// currently has few assertions but is essentially a regression tests

nit: s/tests/test/.

Previously, we passed the import Resumer's context directly to our DistSQLReceiver and to (*sql.DistSQLPlanner).Run. This context is canceled when the user cancels or pauses a job. In practice, this setup made it very common for dsp.Run to return before the processors have shut down. Here, we create a separate context for the distsql flow. When the Resumer's context is canceled, we SetError on the DistSQLReceiver which will transition the receiver to DrainRequested which will be propagated to remote processors. Eventually, this leads to all remote processors exiting, and the entire flow shutting down. Note that the propagation of the draining status happens when a message is actually pushed from processor. We push progress messages from the import processors to the distsql receivers every 10 seconds or so. To protect against waiting too long, we explicitly cancel the flow after a timeout. Release note: None

stevendanna · 2022-12-22T20:18:36Z

A good portion of this merged in another PR. This has some problems with some other recent distsql changes. Calling SetError hits a data race when updating the status.

But, what is also interesting is that the previous test that required these changes is no longer reliably catching the problem.

yuzefovich

Hm, I did fix an error recently about SetError in #93360 - I'm assuming you're rebased on top of that, right? Can you share reproduction steps / CI link for the failure?

We also merged yesterday #90864 which should improve the shutdown of the distributed plans that use row-by-row infrastructure, so it might be interesting to try reverting that change and see whether the problem would reproduce.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @msbutler, @rhu713, and @stevendanna)

stevendanna requested review from a team as code owners November 9, 2022 17:33

stevendanna requested review from rhu713 and removed request for a team November 9, 2022 17:33

stevendanna changed the title ~~jobs: clear job claim after execution~~ backupccl: more controlled shutdown during job cancellation Nov 9, 2022

msbutler reviewed Nov 11, 2022

View reviewed changes

stevendanna commented Nov 12, 2022

View reviewed changes

pkg/sql/importer/import_processor_planning.go Outdated Show resolved Hide resolved

stevendanna force-pushed the import-processor-cancel-cancel branch 2 times, most recently from b4c6d79 to 87cbbe1 Compare November 14, 2022 23:25

stevendanna changed the title ~~backupccl: more controlled shutdown during job cancellation~~ sql/importer: more controlled shutdown during job cancellation Nov 21, 2022

msbutler approved these changes Nov 21, 2022

View reviewed changes

stevendanna requested a review from yuzefovich November 23, 2022 08:25

yuzefovich reviewed Nov 23, 2022

View reviewed changes

This was referenced Dec 7, 2022

sql/importer: TestImportDefaultWithResume failed [ingestKVs hangs with BulkAdderFlushesEveryBatch] #92910

Closed

sql/importer: flake in TestCSVImportCanBeResumed [ingestKVs hangs with BulkAdderFlushesEveryBatch] #91828

Closed

stevendanna force-pushed the import-processor-cancel-cancel branch from 87cbbe1 to ebc7e01 Compare December 9, 2022 14:33

stevendanna requested a review from a team December 9, 2022 14:33

stevendanna force-pushed the import-processor-cancel-cancel branch from ebc7e01 to 4946de7 Compare December 9, 2022 15:56

stevendanna force-pushed the import-processor-cancel-cancel branch from 4946de7 to 524e674 Compare December 22, 2022 20:15

yuzefovich reviewed Dec 22, 2022

View reviewed changes

rafiss removed the request for review from a team January 4, 2023 21:02

stevendanna closed this Mar 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql/importer: more controlled shutdown during job cancellation #91615

sql/importer: more controlled shutdown during job cancellation #91615

stevendanna commented Nov 9, 2022 •

edited

Loading

cockroach-teamcity commented Nov 9, 2022

stevendanna commented Nov 9, 2022

msbutler left a comment

msbutler Nov 11, 2022

stevendanna Nov 12, 2022

msbutler Nov 12, 2022

stevendanna Nov 15, 2022

msbutler Nov 15, 2022

msbutler Nov 15, 2022

msbutler Nov 21, 2022

msbutler left a comment

stevendanna commented Nov 23, 2022

yuzefovich left a comment

stevendanna commented Dec 22, 2022

yuzefovich left a comment

	g.GoCtx(func(ctx context.Context) error {
	tick := time.NewTicker(time.Second * 10)
	defer tick.Stop()
	done := ctx.Done()
	for {
	select {
	case <-done:
	return ctx.Err()
	case <-stopProgress:
	return nil
	case <-tick.C:
	pushProgress()
	}
	}
	})

sql/importer: more controlled shutdown during job cancellation #91615

sql/importer: more controlled shutdown during job cancellation #91615

Conversation

stevendanna commented Nov 9, 2022 • edited Loading

cockroach-teamcity commented Nov 9, 2022

stevendanna commented Nov 9, 2022

msbutler left a comment

Choose a reason for hiding this comment

msbutler Nov 11, 2022

Choose a reason for hiding this comment

stevendanna Nov 12, 2022

Choose a reason for hiding this comment

msbutler Nov 12, 2022

Choose a reason for hiding this comment

stevendanna Nov 15, 2022

Choose a reason for hiding this comment

msbutler Nov 15, 2022

Choose a reason for hiding this comment

msbutler Nov 15, 2022

Choose a reason for hiding this comment

msbutler Nov 21, 2022

Choose a reason for hiding this comment

msbutler left a comment

Choose a reason for hiding this comment

stevendanna commented Nov 23, 2022

yuzefovich left a comment

Choose a reason for hiding this comment

stevendanna commented Dec 22, 2022

yuzefovich left a comment

Choose a reason for hiding this comment

stevendanna commented Nov 9, 2022 •

edited

Loading