-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rowexec: add test to verify uncertainty errors are properly returned #51375
Conversation
BTW I set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is currently a WIP. The test is timing out under stress with the row execution engine and I'm looking for feedback on what I'm doing wrong. @andreimatei it looks like the query is retried indefinitely:
returning rwue error for key /Table/53/1/20 :0:
distsql receiver got error ReadWithinUncertaintyIntervalError: read at time 1594649777.533202000,0 encountered previous write with future timestamp 1594649777.533202001,0 within uncertainty interval `t <= 1594649778.033202000,0`; observed timestamps: [{1 1594649777.533202000,0} {3 1594649777.534206000,1}]
returning rwue error for key /Table/53/1/20 :0:
distsql receiver got error ReadWithinUncertaintyIntervalError: read at time 1594649777.534206000,1 encountered previous write with future timestamp 1594649777.534206001,1 within uncertainty interval `t <= 1594649778.033202000,0`; observed timestamps: [{1 1594649777.533202000,0} {2 1594649777.534797000,0} {3 1594649777.534206000,1}]
returning rwue error for key /Table/53/1/20 :0:
returning rwue error for key /Table/53/1/20 :0:
distsql receiver got error ReadWithinUncertaintyIntervalError: read at time 1594649777.534206001,2 encountered previous write with future timestamp 1594649777.534206002,2 within uncertainty interval `t <= 1594649778.033202000,0`; observed timestamps: [{1 1594649777.533202000,0} {2 1594649777.534797000,0} {3 1594649777.534206000,1}]
returning rwue error for key /Table/53/1/20 :0:
returning rwue error for key /Table/53/1/20 :0:
<...>
Is there a way to turn off these retries?
(haven't looked at the patch)
Automatic retries are disabled once results are returned to the client - which happens when rows are produced after the buffer size is exhausted. An error encountered before the first row in a txn is returned is always retried. In particular, making this guarantee is the point of specific effort to swallow RWUI: a limit 1
query in an implicit txn is always retried automatically.
If you want to disable auto-retries for a txn, you can run a dummy query in the txn's beginning.
Does this help?
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @yuzefovich)
Sure, once we know what's actually going on there. |
Thanks, that's probably the issue. Will try that and update.
If we have the plan/query, once this test is stable, it might be easier to add that query to this test to reproduce deterministically and make it easier to track down the root cause. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r1, 7 of 7 files at r2, 2 of 2 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto and @yuzefovich)
pkg/sql/colexec/routers.go, line 109 at r1 (raw file):
syncutil.Mutex state routerOutputOpState // forwardedErr is an error that was forwarded by the HashRouter. It set,
nit: s/It/If/
.
By "always returns" do you mean that it will be returning this error on all calls to Next
once the error is set? It might be worth spelling this out.
pkg/sql/colexec/routers.go, line 352 at r1 (raw file):
func (o *routerOutputOp) forwardErrLocked(err error) { if err != nil {
Is it possible that o.mu.forwardedErr
is already non-nil here?
pkg/sql/colexec/routers_test.go, line 715 at r1 (raw file):
r := newHashRouterWithOutputs(in, typs, []uint32{0}, unbufferedCh, routerOutputs, nil /* toDrain */, nil /* toClose */) verifyAtLeastOneOutputGotCancellationError := func(t *testing.T) {
super nit: maybe s/verify/assert/
.
pkg/sql/colexec/routers_test.go, line 719 at r1 (raw file):
for _, output := range outputs { if errors.Is(output.forwardedErr, context.Canceled) { return
I think we're actually missing a check that at least one output got the error: the way I see it right now, it is possible that forwardedErr
of all outputs is nil
, and we don't fail the test.
pkg/sql/colflow/vectorized_flow.go, line 733 at r2 (raw file):
return nil, nil, nil, err } inbox, err := s.remoteComponentCreator.newInbox(ctx, colmem.NewAllocator(ctx, s.newStreamingMemAccount(flowCtx), factory), input.ColumnTypes, inputStream.StreamID)
nit: this line is now longer than 100 characters.
pkg/sql/colflow/vectorized_flow_shutdown_test.go, line 210 at r2 (raw file):
inboxMemAccount := testMemMonitor.MakeBoundAccount() defer inboxMemAccount.Close(ctxLocal) inbox, err := colrpc.NewInbox(context.Background(), colmem.NewAllocator(ctxLocal, &inboxMemAccount, testColumnFactory), typs, execinfrapb.StreamID(streamID))
nit: it'd probably be nice to have a separate ctxFlow
variable or something.
Also the line is too long.
pkg/sql/colflow/vectorized_flow_shutdown_test.go, line 300 at r2 (raw file):
inboxMemAccount := testMemMonitor.MakeBoundAccount() defer inboxMemAccount.Close(ctxAnotherRemote) inbox, err := colrpc.NewInbox(context.Background(), colmem.NewAllocator(ctxAnotherRemote, &inboxMemAccount, testColumnFactory), typs, execinfrapb.StreamID(streamID))
ditto
pkg/sql/colflow/colrpc/inbox.go, line 203 at r2 (raw file):
return fmt.Errorf("%s: streamCtx while waiting for reader (remote client canceled)", streamCtx.Err()) case <-i.flowCtx.Done(): return fmt.Errorf("%s: flowCtx while waiting for reader (remote client canceled)", i.flowCtx.Err())
I don't thinks it's "remote client" canceled, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)
pkg/sql/colexec/routers.go, line 352 at r1 (raw file):
Previously, yuzefovich wrote…
Is it possible that
o.mu.forwardedErr
is already non-nil here?
Yes, although not in practice. I think it's fine to overwrite the error in this case, added a comment. What do you think?
pkg/sql/colexec/routers_test.go, line 715 at r1 (raw file):
Previously, yuzefovich wrote…
super nit: maybe
s/verify/assert/
.
Done.
pkg/sql/colexec/routers_test.go, line 719 at r1 (raw file):
Previously, yuzefovich wrote…
I think we're actually missing a check that at least one output got the error: the way I see it right now, it is possible that
forwardedErr
of all outputs isnil
, and we don't fail the test.
Done. It actually doesn't make sense to verify this in this test (since we're using a mock output and we'd just be testing testing code). This behavior is tested in the randomized test.
pkg/sql/colflow/vectorized_flow.go, line 733 at r2 (raw file):
Previously, yuzefovich wrote…
nit: this line is now longer than 100 characters.
Shouldn't a linter would catch this? Done.
pkg/sql/colflow/vectorized_flow_shutdown_test.go, line 210 at r2 (raw file):
Previously, yuzefovich wrote…
nit: it'd probably be nice to have a separate
ctxFlow
variable or something.Also the line is too long.
Changed to use the test flow contexts which I think is the actual thing.
pkg/sql/colflow/vectorized_flow_shutdown_test.go, line 300 at r2 (raw file):
Previously, yuzefovich wrote…
ditto
Done.
pkg/sql/colflow/colrpc/inbox.go, line 203 at r2 (raw file):
Previously, yuzefovich wrote…
I don't thinks it's "remote client" canceled, right?
Done.
This is no longer a WIP. Will look into adding a test case for https://github.com/cockroachlabs/support/issues/513 |
Let me know if you need help reproing / explaining what I've found |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 11 files at r4, 7 of 7 files at r5, 2 of 2 files at r6, 1 of 1 files at r7.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto)
pkg/sql/colexec/routers.go, line 352 at r1 (raw file):
Previously, asubiotto (Alfonso Subiotto Marqués) wrote…
Yes, although not in practice. I think it's fine to overwrite the error in this case, added a comment. What do you think?
SGTM.
pkg/sql/colflow/vectorized_flow.go, line 733 at r2 (raw file):
Previously, asubiotto (Alfonso Subiotto Marqués) wrote…
Shouldn't a linter would catch this? Done.
I don't think we have a linter for that, I have set up Goland to show lines at 80 and 100 characters as our style guide says that we should aim for no more than 80 characters per line, and 100 characters is a pretty hard limit.
pkg/sql/rowexec/processors_test.go, line 714 at r7 (raw file):
) // Best thing to do in this case I think is to
nit: incomplete comment.
pkg/sql/rowexec/processors_test.go, line 783 at r7 (raw file):
// overrideErrorOrigin, if set, enables the request filter on the node // indices provided. overrideErrorOrigin func() int
Do you intend to add more test cases that actually use overrideErrorOrigin
and overrideConn
knobs?
The vectorized HashRouter would previously buffer any error that happened during execution and return zero-batches so that this error would be drained through DrainMeta. This would incorrectly swallow an uncertainty error in some cases. This commit returns an error immediately during execution. Release note: None
There are cases where Inbox.Next is not called by downstream components (e.g. left side of HashJoiner). This could cause an inbound stream to be indefinitely stuck waiting for Next to be called (and the context to be passed to that goroutine). This commit captures the flowCtx and listens for its cancellation in these cases. The proper way to fix this would be to change colexecbase.Operator.Init to take a context, since it must be called unconditionally. This commit is a temporary solution to fix the immediate issue, but a future commit will introduce this refactor. Release note (bug fix): some queries in the vectorized execution engine could previously hang during cleanup, this is now fixed.
The Columnarizer would previously buffer errors to return while draining, which could result in swallowed uncertainty errors. This commit changes the Columnarizer to immediately propagate such errors. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a skipped test that repros #51458. There still seems to be an issue where we're overwriting an uncertainty error with context canceled somewhere. I'll look into this but I'm also happy to get this in with a small change to not fail in this case before I leave and investigate when I'm back.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto and @yuzefovich)
pkg/sql/rowexec/processors_test.go, line 783 at r7 (raw file):
Previously, yuzefovich wrote…
Do you intend to add more test cases that actually use
overrideErrorOrigin
andoverrideConn
knobs?
Removed overrideConn
, but I'm using overrideErrorOrigin
in a new test I added that simulates #51458
Note that `vectorize=true` config is temporarily skipped because there appears to be a race between the expected uncertainty error and a context cancellation error. Release note: None
I took a quick glance at why we're seeing context cancellation error instead of the expected RWUI error, and nothing jumped out to me, so I added a skip for bors r=yuzefovich |
bors crashed bors r=yuzefovich |
Build failed (retrying...) |
Build failed (retrying...) |
Build succeeded |
This PR fixes a bunch of instances that would swallow uncertainty errors in the vectorized engine and adds a test for this. This also uncovered a query termination edge case in the vectorized Inbox we weren't handling properly.
Release note (bug fix): fix edge cases where the vectorized execution engine would drop some results under contention