rowexec: add test to verify uncertainty errors are properly returned #51375

asubiotto · 2020-07-13T14:23:34Z

This PR fixes a bunch of instances that would swallow uncertainty errors in the vectorized engine and adds a test for this. This also uncovered a query termination edge case in the vectorized Inbox we weren't handling properly.

Release note (bug fix): fix edge cases where the vectorized execution engine would drop some results under contention

cockroach-teamcity · 2020-07-13T14:23:44Z

This change is

asubiotto · 2020-07-13T14:36:23Z

BTW I set _, err = db.Exec("SET CLUSTER SETTING sql.defaults.results_buffer.size = '0'") which I would expect would disable retries.

andreimatei

This is currently a WIP. The test is timing out under stress with the row execution engine and I'm looking for feedback on what I'm doing wrong. @andreimatei it looks like the query is retried indefinitely:

returning rwue error for key /Table/53/1/20 :0:
distsql receiver got error ReadWithinUncertaintyIntervalError: read at time 1594649777.533202000,0 encountered previous write with future timestamp 1594649777.533202001,0 within uncertainty interval `t <= 1594649778.033202000,0`; observed timestamps: [{1 1594649777.533202000,0} {3 1594649777.534206000,1}]
returning rwue error for key /Table/53/1/20 :0:
distsql receiver got error ReadWithinUncertaintyIntervalError: read at time 1594649777.534206000,1 encountered previous write with future timestamp 1594649777.534206001,1 within uncertainty interval `t <= 1594649778.033202000,0`; observed timestamps: [{1 1594649777.533202000,0} {2 1594649777.534797000,0} {3 1594649777.534206000,1}]
returning rwue error for key /Table/53/1/20 :0:
returning rwue error for key /Table/53/1/20 :0:
distsql receiver got error ReadWithinUncertaintyIntervalError: read at time 1594649777.534206001,2 encountered previous write with future timestamp 1594649777.534206002,2 within uncertainty interval `t <= 1594649778.033202000,0`; observed timestamps: [{1 1594649777.533202000,0} {2 1594649777.534797000,0} {3 1594649777.534206000,1}]
returning rwue error for key /Table/53/1/20 :0:
returning rwue error for key /Table/53/1/20 :0:
<...>

Is there a way to turn off these retries?

(haven't looked at the patch)
Automatic retries are disabled once results are returned to the client - which happens when rows are produced after the buffer size is exhausted. An error encountered before the first row in a txn is returned is always retried. In particular, making this guarantee is the point of specific effort to swallow RWUI: a limit 1 query in an implicit txn is always retried automatically.
If you want to disable auto-retries for a txn, you can run a dummy query in the txn's beginning.
Does this help?

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @yuzefovich)

rohany · 2020-07-13T15:56:17Z

We should also add a test for cockroachlabs/support#513 cc @rohany

Sure, once we know what's actually going on there.

asubiotto · 2020-07-13T15:59:02Z

Thanks, that's probably the issue. Will try that and update.

Sure, once we know what's actually going on there.

If we have the plan/query, once this test is stable, it might be easier to add that query to this test to reproduce deterministically and make it easier to track down the root cause.

yuzefovich

Reviewed 2 of 2 files at r1, 7 of 7 files at r2, 2 of 2 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto and @yuzefovich)

pkg/sql/colexec/routers.go, line 109 at r1 (raw file):

		syncutil.Mutex
		state routerOutputOpState
		// forwardedErr is an error that was forwarded by the HashRouter. It set,

nit: s/It/If/.

By "always returns" do you mean that it will be returning this error on all calls to Next once the error is set? It might be worth spelling this out.

pkg/sql/colexec/routers.go, line 352 at r1 (raw file):

func (o *routerOutputOp) forwardErrLocked(err error) {
	if err != nil {

Is it possible that o.mu.forwardedErr is already non-nil here?

pkg/sql/colexec/routers_test.go, line 715 at r1 (raw file):

	r := newHashRouterWithOutputs(in, typs, []uint32{0}, unbufferedCh, routerOutputs, nil /* toDrain */, nil /* toClose */)

	verifyAtLeastOneOutputGotCancellationError := func(t *testing.T) {

super nit: maybe s/verify/assert/.

pkg/sql/colexec/routers_test.go, line 719 at r1 (raw file):

		for _, output := range outputs {
			if errors.Is(output.forwardedErr, context.Canceled) {
				return

I think we're actually missing a check that at least one output got the error: the way I see it right now, it is possible that forwardedErr of all outputs is nil, and we don't fail the test.

pkg/sql/colflow/vectorized_flow.go, line 733 at r2 (raw file):

				return nil, nil, nil, err
			}
			inbox, err := s.remoteComponentCreator.newInbox(ctx, colmem.NewAllocator(ctx, s.newStreamingMemAccount(flowCtx), factory), input.ColumnTypes, inputStream.StreamID)

nit: this line is now longer than 100 characters.

pkg/sql/colflow/vectorized_flow_shutdown_test.go, line 210 at r2 (raw file):

					inboxMemAccount := testMemMonitor.MakeBoundAccount()
					defer inboxMemAccount.Close(ctxLocal)
					inbox, err := colrpc.NewInbox(context.Background(), colmem.NewAllocator(ctxLocal, &inboxMemAccount, testColumnFactory), typs, execinfrapb.StreamID(streamID))

nit: it'd probably be nice to have a separate ctxFlow variable or something.

Also the line is too long.

pkg/sql/colflow/vectorized_flow_shutdown_test.go, line 300 at r2 (raw file):

					inboxMemAccount := testMemMonitor.MakeBoundAccount()
					defer inboxMemAccount.Close(ctxAnotherRemote)
					inbox, err := colrpc.NewInbox(context.Background(), colmem.NewAllocator(ctxAnotherRemote, &inboxMemAccount, testColumnFactory), typs, execinfrapb.StreamID(streamID))

ditto

pkg/sql/colflow/colrpc/inbox.go, line 203 at r2 (raw file):

		return fmt.Errorf("%s: streamCtx while waiting for reader (remote client canceled)", streamCtx.Err())
	case <-i.flowCtx.Done():
		return fmt.Errorf("%s: flowCtx while waiting for reader (remote client canceled)", i.flowCtx.Err())

I don't thinks it's "remote client" canceled, right?

asubiotto

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)

pkg/sql/colexec/routers.go, line 352 at r1 (raw file):

Previously, yuzefovich wrote…

Is it possible that o.mu.forwardedErr is already non-nil here?

Yes, although not in practice. I think it's fine to overwrite the error in this case, added a comment. What do you think?

pkg/sql/colexec/routers_test.go, line 715 at r1 (raw file):

Previously, yuzefovich wrote…

super nit: maybe s/verify/assert/.

Done.

pkg/sql/colexec/routers_test.go, line 719 at r1 (raw file):

Previously, yuzefovich wrote…

I think we're actually missing a check that at least one output got the error: the way I see it right now, it is possible that forwardedErr of all outputs is nil, and we don't fail the test.

Done. It actually doesn't make sense to verify this in this test (since we're using a mock output and we'd just be testing testing code). This behavior is tested in the randomized test.

pkg/sql/colflow/vectorized_flow.go, line 733 at r2 (raw file):

Previously, yuzefovich wrote…

nit: this line is now longer than 100 characters.

Shouldn't a linter would catch this? Done.

pkg/sql/colflow/vectorized_flow_shutdown_test.go, line 210 at r2 (raw file):

Previously, yuzefovich wrote…

nit: it'd probably be nice to have a separate ctxFlow variable or something.

Also the line is too long.

Changed to use the test flow contexts which I think is the actual thing.

pkg/sql/colflow/vectorized_flow_shutdown_test.go, line 300 at r2 (raw file):

Previously, yuzefovich wrote…

ditto

Done.

pkg/sql/colflow/colrpc/inbox.go, line 203 at r2 (raw file):

Previously, yuzefovich wrote…

I don't thinks it's "remote client" canceled, right?

Done.

asubiotto · 2020-07-14T09:36:20Z

This is no longer a WIP. Will look into adding a test case for https://github.com/cockroachlabs/support/issues/513

rohany · 2020-07-14T14:24:25Z

Let me know if you need help reproing / explaining what I've found

yuzefovich

Reviewed 5 of 11 files at r4, 7 of 7 files at r5, 2 of 2 files at r6, 1 of 1 files at r7.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto)

pkg/sql/colexec/routers.go, line 352 at r1 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

Yes, although not in practice. I think it's fine to overwrite the error in this case, added a comment. What do you think?

SGTM.

pkg/sql/colflow/vectorized_flow.go, line 733 at r2 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

Shouldn't a linter would catch this? Done.

I don't think we have a linter for that, I have set up Goland to show lines at 80 and 100 characters as our style guide says that we should aim for no more than 80 characters per line, and 100 characters is a pretty hard limit.

pkg/sql/rowexec/processors_test.go, line 714 at r7 (raw file):

	)

	// Best thing to do in this case I think is to

nit: incomplete comment.

pkg/sql/rowexec/processors_test.go, line 783 at r7 (raw file):

		// overrideErrorOrigin, if set, enables the request filter on the node
		// indices provided.
		overrideErrorOrigin func() int

Do you intend to add more test cases that actually use overrideErrorOrigin and overrideConn knobs?

The vectorized HashRouter would previously buffer any error that happened during execution and return zero-batches so that this error would be drained through DrainMeta. This would incorrectly swallow an uncertainty error in some cases. This commit returns an error immediately during execution. Release note: None

There are cases where Inbox.Next is not called by downstream components (e.g. left side of HashJoiner). This could cause an inbound stream to be indefinitely stuck waiting for Next to be called (and the context to be passed to that goroutine). This commit captures the flowCtx and listens for its cancellation in these cases. The proper way to fix this would be to change colexecbase.Operator.Init to take a context, since it must be called unconditionally. This commit is a temporary solution to fix the immediate issue, but a future commit will introduce this refactor. Release note (bug fix): some queries in the vectorized execution engine could previously hang during cleanup, this is now fixed.

The Columnarizer would previously buffer errors to return while draining, which could result in swallowed uncertainty errors. This commit changes the Columnarizer to immediately propagate such errors. Release note: None

asubiotto

Added a skipped test that repros #51458. There still seems to be an issue where we're overwriting an uncertainty error with context canceled somewhere. I'll look into this but I'm also happy to get this in with a small change to not fail in this case before I leave and investigate when I'm back.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto and @yuzefovich)

pkg/sql/rowexec/processors_test.go, line 783 at r7 (raw file):

Previously, yuzefovich wrote…

Do you intend to add more test cases that actually use overrideErrorOrigin and overrideConn knobs?

Removed overrideConn, but I'm using overrideErrorOrigin in a new test I added that simulates #51458

Note that `vectorize=true` config is temporarily skipped because there appears to be a race between the expected uncertainty error and a context cancellation error. Release note: None

yuzefovich · 2020-07-16T18:30:44Z

I took a quick glance at why we're seeing context cancellation error instead of the expected RWUI error, and nothing jumped out to me, so I added a skip for vectorize=true config.

bors r=yuzefovich

yuzefovich · 2020-07-16T20:02:48Z

bors crashed

bors r=yuzefovich

craig · 2020-07-16T20:54:15Z

Build failed (retrying...)

GitHub CI (Cockroach)

craig · 2020-07-16T22:35:50Z

Build failed (retrying...)

GitHub CI (Cockroach)

craig · 2020-07-17T00:19:45Z

Build succeeded

GitHub CI (Cockroach)

asubiotto requested review from andreimatei, yuzefovich and a team July 13, 2020 14:23

andreimatei reviewed Jul 13, 2020

View reviewed changes

yuzefovich reviewed Jul 13, 2020

View reviewed changes

asubiotto changed the title ~~[WIP] rowexec: add test to verify uncertainty errors are properly returned~~ rowexec: add test to verify uncertainty errors are properly returned Jul 14, 2020

asubiotto commented Jul 14, 2020

View reviewed changes

yuzefovich reviewed Jul 14, 2020

View reviewed changes

asubiotto mentioned this pull request Jul 15, 2020

execinfra: uncertainty error can be incorrectly swallowed during a distributed join stage #51458

Closed

asubiotto added 3 commits July 15, 2020 16:14

colexec: make Columnarizer return error immediately

213f212

The Columnarizer would previously buffer errors to return while draining, which could result in swallowed uncertainty errors. This commit changes the Columnarizer to immediately propagate such errors. Release note: None

asubiotto commented Jul 15, 2020

View reviewed changes

rowexec: add test to verify uncertainty errors are properly returned

4ab55c5

Note that `vectorize=true` config is temporarily skipped because there appears to be a race between the expected uncertainty error and a context cancellation error. Release note: None

craig bot merged commit af0031e into cockroachdb:master Jul 17, 2020

yuzefovich mentioned this pull request Jul 17, 2020

rowflow,colexec: make routers propagate errors to all non-closed outputs #51518

Merged

irfansharif mentioned this pull request Jul 21, 2020

roachtest: scaledata/filesystem_simulator/nodes=6 failed #51215

Closed

asubiotto deleted the rwue branch August 3, 2020 12:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rowexec: add test to verify uncertainty errors are properly returned #51375

rowexec: add test to verify uncertainty errors are properly returned #51375

asubiotto commented Jul 13, 2020 •

edited

Loading

cockroach-teamcity commented Jul 13, 2020

asubiotto commented Jul 13, 2020

andreimatei left a comment •

edited by asubiotto

Loading

rohany commented Jul 13, 2020

asubiotto commented Jul 13, 2020

yuzefovich left a comment

asubiotto left a comment

asubiotto commented Jul 14, 2020

rohany commented Jul 14, 2020

yuzefovich left a comment

asubiotto left a comment

yuzefovich commented Jul 16, 2020

yuzefovich commented Jul 16, 2020

craig bot commented Jul 16, 2020

craig bot commented Jul 16, 2020

craig bot commented Jul 17, 2020

rowexec: add test to verify uncertainty errors are properly returned #51375

rowexec: add test to verify uncertainty errors are properly returned #51375

Conversation

asubiotto commented Jul 13, 2020 • edited Loading

cockroach-teamcity commented Jul 13, 2020

asubiotto commented Jul 13, 2020

andreimatei left a comment • edited by asubiotto Loading

Choose a reason for hiding this comment

rohany commented Jul 13, 2020

asubiotto commented Jul 13, 2020

yuzefovich left a comment

Choose a reason for hiding this comment

asubiotto left a comment

Choose a reason for hiding this comment

asubiotto commented Jul 14, 2020

rohany commented Jul 14, 2020

yuzefovich left a comment

Choose a reason for hiding this comment

asubiotto left a comment

Choose a reason for hiding this comment

yuzefovich commented Jul 16, 2020

yuzefovich commented Jul 16, 2020

craig bot commented Jul 16, 2020

Build failed (retrying...)

craig bot commented Jul 16, 2020

Build failed (retrying...)

craig bot commented Jul 17, 2020

Build succeeded

asubiotto commented Jul 13, 2020 •

edited

Loading

andreimatei left a comment •

edited by asubiotto

Loading