Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rowexec: add test to verify uncertainty errors are properly returned #51375

Merged
merged 4 commits into from
Jul 17, 2020
Merged

rowexec: add test to verify uncertainty errors are properly returned #51375

merged 4 commits into from
Jul 17, 2020

Conversation

asubiotto
Copy link
Contributor

@asubiotto asubiotto commented Jul 13, 2020

This PR fixes a bunch of instances that would swallow uncertainty errors in the vectorized engine and adds a test for this. This also uncovered a query termination edge case in the vectorized Inbox we weren't handling properly.

Release note (bug fix): fix edge cases where the vectorized execution engine would drop some results under contention

@asubiotto asubiotto requested review from andreimatei, yuzefovich and a team July 13, 2020 14:23
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@asubiotto
Copy link
Contributor Author

BTW I set _, err = db.Exec("SET CLUSTER SETTING sql.defaults.results_buffer.size = '0'") which I would expect would disable retries.

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently a WIP. The test is timing out under stress with the row execution engine and I'm looking for feedback on what I'm doing wrong. @andreimatei it looks like the query is retried indefinitely:

returning rwue error for key /Table/53/1/20 :0:
distsql receiver got error ReadWithinUncertaintyIntervalError: read at time 1594649777.533202000,0 encountered previous write with future timestamp 1594649777.533202001,0 within uncertainty interval `t <= 1594649778.033202000,0`; observed timestamps: [{1 1594649777.533202000,0} {3 1594649777.534206000,1}]
returning rwue error for key /Table/53/1/20 :0:
distsql receiver got error ReadWithinUncertaintyIntervalError: read at time 1594649777.534206000,1 encountered previous write with future timestamp 1594649777.534206001,1 within uncertainty interval `t <= 1594649778.033202000,0`; observed timestamps: [{1 1594649777.533202000,0} {2 1594649777.534797000,0} {3 1594649777.534206000,1}]
returning rwue error for key /Table/53/1/20 :0:
returning rwue error for key /Table/53/1/20 :0:
distsql receiver got error ReadWithinUncertaintyIntervalError: read at time 1594649777.534206001,2 encountered previous write with future timestamp 1594649777.534206002,2 within uncertainty interval `t <= 1594649778.033202000,0`; observed timestamps: [{1 1594649777.533202000,0} {2 1594649777.534797000,0} {3 1594649777.534206000,1}]
returning rwue error for key /Table/53/1/20 :0:
returning rwue error for key /Table/53/1/20 :0:
<...>

Is there a way to turn off these retries?

(haven't looked at the patch)
Automatic retries are disabled once results are returned to the client - which happens when rows are produced after the buffer size is exhausted. An error encountered before the first row in a txn is returned is always retried. In particular, making this guarantee is the point of specific effort to swallow RWUI: a limit 1 query in an implicit txn is always retried automatically.
If you want to disable auto-retries for a txn, you can run a dummy query in the txn's beginning.
Does this help?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @yuzefovich)

@rohany
Copy link
Contributor

rohany commented Jul 13, 2020

We should also add a test for cockroachlabs/support#513 cc @rohany

Sure, once we know what's actually going on there.

@asubiotto
Copy link
Contributor Author

Thanks, that's probably the issue. Will try that and update.

Sure, once we know what's actually going on there.

If we have the plan/query, once this test is stable, it might be easier to add that query to this test to reproduce deterministically and make it easier to track down the root cause.

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r1, 7 of 7 files at r2, 2 of 2 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto and @yuzefovich)


pkg/sql/colexec/routers.go, line 109 at r1 (raw file):

		syncutil.Mutex
		state routerOutputOpState
		// forwardedErr is an error that was forwarded by the HashRouter. It set,

nit: s/It/If/.

By "always returns" do you mean that it will be returning this error on all calls to Next once the error is set? It might be worth spelling this out.


pkg/sql/colexec/routers.go, line 352 at r1 (raw file):

func (o *routerOutputOp) forwardErrLocked(err error) {
	if err != nil {

Is it possible that o.mu.forwardedErr is already non-nil here?


pkg/sql/colexec/routers_test.go, line 715 at r1 (raw file):

	r := newHashRouterWithOutputs(in, typs, []uint32{0}, unbufferedCh, routerOutputs, nil /* toDrain */, nil /* toClose */)

	verifyAtLeastOneOutputGotCancellationError := func(t *testing.T) {

super nit: maybe s/verify/assert/.


pkg/sql/colexec/routers_test.go, line 719 at r1 (raw file):

		for _, output := range outputs {
			if errors.Is(output.forwardedErr, context.Canceled) {
				return

I think we're actually missing a check that at least one output got the error: the way I see it right now, it is possible that forwardedErr of all outputs is nil, and we don't fail the test.


pkg/sql/colflow/vectorized_flow.go, line 733 at r2 (raw file):

				return nil, nil, nil, err
			}
			inbox, err := s.remoteComponentCreator.newInbox(ctx, colmem.NewAllocator(ctx, s.newStreamingMemAccount(flowCtx), factory), input.ColumnTypes, inputStream.StreamID)

nit: this line is now longer than 100 characters.


pkg/sql/colflow/vectorized_flow_shutdown_test.go, line 210 at r2 (raw file):

					inboxMemAccount := testMemMonitor.MakeBoundAccount()
					defer inboxMemAccount.Close(ctxLocal)
					inbox, err := colrpc.NewInbox(context.Background(), colmem.NewAllocator(ctxLocal, &inboxMemAccount, testColumnFactory), typs, execinfrapb.StreamID(streamID))

nit: it'd probably be nice to have a separate ctxFlow variable or something.

Also the line is too long.


pkg/sql/colflow/vectorized_flow_shutdown_test.go, line 300 at r2 (raw file):

					inboxMemAccount := testMemMonitor.MakeBoundAccount()
					defer inboxMemAccount.Close(ctxAnotherRemote)
					inbox, err := colrpc.NewInbox(context.Background(), colmem.NewAllocator(ctxAnotherRemote, &inboxMemAccount, testColumnFactory), typs, execinfrapb.StreamID(streamID))

ditto


pkg/sql/colflow/colrpc/inbox.go, line 203 at r2 (raw file):

		return fmt.Errorf("%s: streamCtx while waiting for reader (remote client canceled)", streamCtx.Err())
	case <-i.flowCtx.Done():
		return fmt.Errorf("%s: flowCtx while waiting for reader (remote client canceled)", i.flowCtx.Err())

I don't thinks it's "remote client" canceled, right?

@asubiotto asubiotto changed the title [WIP] rowexec: add test to verify uncertainty errors are properly returned rowexec: add test to verify uncertainty errors are properly returned Jul 14, 2020
Copy link
Contributor Author

@asubiotto asubiotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)


pkg/sql/colexec/routers.go, line 352 at r1 (raw file):

Previously, yuzefovich wrote…

Is it possible that o.mu.forwardedErr is already non-nil here?

Yes, although not in practice. I think it's fine to overwrite the error in this case, added a comment. What do you think?


pkg/sql/colexec/routers_test.go, line 715 at r1 (raw file):

Previously, yuzefovich wrote…

super nit: maybe s/verify/assert/.

Done.


pkg/sql/colexec/routers_test.go, line 719 at r1 (raw file):

Previously, yuzefovich wrote…

I think we're actually missing a check that at least one output got the error: the way I see it right now, it is possible that forwardedErr of all outputs is nil, and we don't fail the test.

Done. It actually doesn't make sense to verify this in this test (since we're using a mock output and we'd just be testing testing code). This behavior is tested in the randomized test.


pkg/sql/colflow/vectorized_flow.go, line 733 at r2 (raw file):

Previously, yuzefovich wrote…

nit: this line is now longer than 100 characters.

Shouldn't a linter would catch this? Done.


pkg/sql/colflow/vectorized_flow_shutdown_test.go, line 210 at r2 (raw file):

Previously, yuzefovich wrote…

nit: it'd probably be nice to have a separate ctxFlow variable or something.

Also the line is too long.

Changed to use the test flow contexts which I think is the actual thing.


pkg/sql/colflow/vectorized_flow_shutdown_test.go, line 300 at r2 (raw file):

Previously, yuzefovich wrote…

ditto

Done.


pkg/sql/colflow/colrpc/inbox.go, line 203 at r2 (raw file):

Previously, yuzefovich wrote…

I don't thinks it's "remote client" canceled, right?

Done.

@asubiotto
Copy link
Contributor Author

This is no longer a WIP. Will look into adding a test case for https://github.com/cockroachlabs/support/issues/513

@rohany
Copy link
Contributor

rohany commented Jul 14, 2020

Let me know if you need help reproing / explaining what I've found

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 11 files at r4, 7 of 7 files at r5, 2 of 2 files at r6, 1 of 1 files at r7.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto)


pkg/sql/colexec/routers.go, line 352 at r1 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

Yes, although not in practice. I think it's fine to overwrite the error in this case, added a comment. What do you think?

SGTM.


pkg/sql/colflow/vectorized_flow.go, line 733 at r2 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

Shouldn't a linter would catch this? Done.

I don't think we have a linter for that, I have set up Goland to show lines at 80 and 100 characters as our style guide says that we should aim for no more than 80 characters per line, and 100 characters is a pretty hard limit.


pkg/sql/rowexec/processors_test.go, line 714 at r7 (raw file):

	)

	// Best thing to do in this case I think is to

nit: incomplete comment.


pkg/sql/rowexec/processors_test.go, line 783 at r7 (raw file):

		// overrideErrorOrigin, if set, enables the request filter on the node
		// indices provided.
		overrideErrorOrigin func() int

Do you intend to add more test cases that actually use overrideErrorOrigin and overrideConn knobs?

The vectorized HashRouter would previously buffer any error that happened
during execution and return zero-batches so that this error would be drained
through DrainMeta. This would incorrectly swallow an uncertainty error in some
cases. This commit returns an error immediately during execution.

Release note: None
There are cases where Inbox.Next is not called by downstream components
(e.g. left side of HashJoiner). This could cause an inbound stream to be
indefinitely stuck waiting for Next to be called (and the context to be passed
to that goroutine). This commit captures the flowCtx and listens for its
cancellation in these cases.

The proper way to fix this would be to change colexecbase.Operator.Init to take
a context, since it must be called unconditionally. This commit is a temporary
solution to fix the immediate issue, but a future commit will introduce this
refactor.

Release note (bug fix): some queries in the vectorized execution engine could
previously hang during cleanup, this is now fixed.
The Columnarizer would previously buffer errors to return while draining, which
could result in swallowed uncertainty errors. This commit changes the
Columnarizer to immediately propagate such errors.

Release note: None
Copy link
Contributor Author

@asubiotto asubiotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a skipped test that repros #51458. There still seems to be an issue where we're overwriting an uncertainty error with context canceled somewhere. I'll look into this but I'm also happy to get this in with a small change to not fail in this case before I leave and investigate when I'm back.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto and @yuzefovich)


pkg/sql/rowexec/processors_test.go, line 783 at r7 (raw file):

Previously, yuzefovich wrote…

Do you intend to add more test cases that actually use overrideErrorOrigin and overrideConn knobs?

Removed overrideConn, but I'm using overrideErrorOrigin in a new test I added that simulates #51458

Note that `vectorize=true` config is temporarily skipped because there
appears to be a race between the expected uncertainty error and
a context cancellation error.

Release note: None
@yuzefovich
Copy link
Member

I took a quick glance at why we're seeing context cancellation error instead of the expected RWUI error, and nothing jumped out to me, so I added a skip for vectorize=true config.

bors r=yuzefovich

@yuzefovich
Copy link
Member

bors crashed

bors r=yuzefovich

@craig
Copy link
Contributor

craig bot commented Jul 16, 2020

Build failed (retrying...)

@craig
Copy link
Contributor

craig bot commented Jul 16, 2020

Build failed (retrying...)

@craig
Copy link
Contributor

craig bot commented Jul 17, 2020

Build succeeded

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants