opt: fix panic recovery for error handling #38570

RaduBerinde · 2019-06-29T03:26:54Z

The major entry points in the optimizer catch all panics that throw an
error and converts them to errors. Unfortunately, this also catches
runtime errors (in which case we convert them to errors and lose the
stack trace).

This change adds a ShouldCatch helper which determines if we should
return a thrown object as an error. If the object is a
runtime.Error, it gets wrapped by an AssertionFailed error which
will cause correct error handling (stack trace, sentry reporting, etc).

As part of this change, we are also removing wrappers like
builderError, which are no longer useful. We fix the opt tester to
fail with the full error information (using %+v) for assertion
errors.

Release note: None

cockroach-teamcity · 2019-06-29T03:27:00Z

This change is

knz

So I think this PR is misguided. When i wrote the code I intended to catch runtime.Error panics and letting them flow through. The reason is that runtime.Error panics are recoverable, and there is no reason to let a cluster go down when they occur.

FYI I even went through the go source code to validate the following:

runtime.Error is only emitted for "soft" errors like out-of-bound accesses, assertion failures, etc
for "serious" internal errors e.g. in the scheduler, bad goroutine state, allocator problem etc, the runtime throws a string which does not implement error and thus will not be captured here.

So, can you explain a little better why you thought this PR was a good idea?

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @justinj, @knz, and @rytaft)

RaduBerinde · 2019-06-29T13:51:35Z

Today if you are working on a change that results in a nil dereference or out-of-bound access, you get a one line error with no stack trace. Good luck debugging that. IMO that is not acceptable, both for development workflow and customer support (what will we do when we get a report from a customer which just says "out of bounds" with no other context?)

When we agreed to catch assertion errors thrown by the optimizer, it was with the condition that we will still always get stack traces for them. The discussion was mostly focused on assertions generated by our code, I don't think we specifically discussed catching runtime errors (at least not to my knowledge). I am ok catching them but only if we don't lose the stack trace.

knz · 2019-06-29T18:39:33Z

Oh I see. If you do errors.WithStack(err) when returning the recovered panic, you'll get the panic stack trace captured with the error. RaduBerinde <[email protected]> schreef op 29 juni 2019 15:52:07 CEST:

…

Today if you are working on a change that results in a nil dereference or out-of-bound access, you get a one line error with no stack trace. Good luck debugging that. IMO that is not acceptable, both for development workflow and customer support (what will we do when we get a report from a customer which just says "out of bounds" with no other context?) When we agreed to catch assertion errors thrown by the optimizer, it was with the condition that we will still always get stack traces for them. The discussion was mostly focused on assertions generated by our code, I don't think we specifically discussed catching runtime errors (at least not to my knowledge). I am ok catching them but only if we don't lose the stack trace. -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #38570 (comment)

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

RaduBerinde · 2019-07-01T02:19:25Z

It doesn't work. The stack trace isn't shown in important cases:

In cockroach demo:

[email protected]:45519/defaultdb> select 1 as lolomg;
pq: runtime error: index out of range
[email protected]:45519/defaultdb>

In an opt test:

--- FAIL: TestBuilder (0.00s)
    --- FAIL: TestBuilder/select (0.00s)
        builder_test.go:60: 
            testdata/select:25: SELECT 1 AS lolomg
            expected:
            
            found:
            error: runtime error: index out of range
FAIL

RaduBerinde · 2019-07-01T02:24:24Z

I put the patch which I ran above in https://github.com/RaduBerinde/cockroach/tree/opt-err-fix-2

RaduBerinde · 2019-07-01T02:43:03Z

Maybe I should try NewAssertionErrorWithWrappedErrf?

knz · 2019-07-01T09:26:47Z

oh yes, absolutely. I hadn't thought of that but indeed it's the best way to ensure we get telemetry, etc.

RaduBerinde · 2019-07-02T20:30:55Z

Just leaving a note with the status of this PR - converting to AssertionFailed didn't quite work because it still doesn't print the stack trace in tests (with %+v); @knz is going to fix that in the error library first.

@RaduBerinde

38710: errors: fix the formatting with %+v r=knz a=knz (found by @RaduBerinde; needed to complete #38570) The new library `github.com/cockroachdb/errors` was not implementing `%+v` formatting properly for assertion and unimplemented errors. The wrong implementation was hiding the details of the cause of these two error types from the formatting logic. Fixing this bug comprehensively required completing the investigation of the Go 2 / `xerrors` error proposal. This revealed that the implementation of `fmt.Formatter` for wrapper errors (a `Format()` method) is required in all cases, at least until Go's stdlib learns about `errors.Formatter`. More details at golang/go#29934 and this commit message: cockroachdb/errors@78b6caa. This patch bumps the dependency `github.com/cockroachdb/errors` to pick up the fixes to assertion failures and unimplemented errors. The new definition of `errors.FormatError()` subsequently required re-implemening `Format)` for `pgerros.withCandidateCode`, which is also done here. Finally, this patch also picks up `errors.As()` and the new streamlined `fmt.Formatter` / `errors.Formatter` interaction, so this patch also simplifies a few custom error types in CockroachDB accordingly. Release note: None Co-authored-by: Raphael 'kena' Poss <[email protected]>

RaduBerinde · 2019-07-08T20:25:31Z

Updated, using NewAssertionErrorWithWrappedErrf now.

knz

Reviewed 19 of 19 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @justinj, and @rytaft)

knz

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @justinj, @RaduBerinde, and @rytaft)

pkg/util/errorutil/catch.go, line 29 at r1 (raw file):

			// Convert runtime errors to internal errors, which display the stack and
			// get reported to Sentry.
			err = errors.NewAssertionErrorWithWrappedErrf(err, "")

That's what's creating the surprising result.
Until I fix this you can make the surprising errors with safe detail disappear (and also introduce a clarification about where the runtime error comes from) as follows:

err = errors.HandledWithMessage(err, "Go runtime error")
err = errors.WithAssertionFailure(err)
err = errors.WithStack(err)

knz

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @justinj, @RaduBerinde, and @rytaft)

pkg/util/errorutil/catch.go, line 29 at r1 (raw file):

Previously, knz (kena) wrote…

That's what's creating the surprising result.
Until I fix this you can make the surprising errors with safe detail disappear (and also introduce a clarification about where the runtime error comes from) as follows:
err = errors.HandledWithMessage(err, "Go runtime error")
err = errors.WithAssertionFailure(err)
err = errors.WithStack(err)
``

</blockquote></details>

see https://github.com/cockroachdb/errors/pull/3

knz

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @justinj, @RaduBerinde, and @rytaft)

pkg/util/errorutil/catch.go, line 29 at r1 (raw file):

Previously, knz (kena) wrote…

see cockroachdb/errors#3

Then you can use err = errors.HandleAsAssertionFailure(err) instead of the 3 lines I listed above.

The major entry points in the optimizer catch all panics that throw an error and converts them to errors. Unfortunately, this also catches runtime errors (in which case we convert them to errors and lose the stack trace). This change adds a `ShouldCatch` helper which determines if we should return a thrown object as an error. If the object is a `runtime.Error`, it gets wrapped by an AssertionFailed error which will cause correct error handling (stack trace, sentry reporting, etc). As part of this change, we are also removing wrappers like `builderError`, which are no longer useful. We fix the opt tester to fail with the full error information (using `%+v`) for assertion errors. Release note: None

RaduBerinde · 2019-07-09T14:17:13Z

Bumped the dep and switched to HandleAsAssertionFailure.

knz

Reviewed 3 of 3 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @justinj, @RaduBerinde, and @rytaft)

RaduBerinde · 2019-07-09T15:51:55Z

TFTR!

bors r+

@lhirata

38570: opt: fix panic recovery for error handling r=RaduBerinde a=RaduBerinde The major entry points in the optimizer catch all panics that throw an error and converts them to errors. Unfortunately, this also catches runtime errors (in which case we convert them to errors and lose the stack trace). This change adds a `ShouldCatch` helper which determines if we should return a thrown object as an error. If the object is a `runtime.Error`, it gets wrapped by an AssertionFailed error which will cause correct error handling (stack trace, sentry reporting, etc). As part of this change, we are also removing wrappers like `builderError`, which are no longer useful. We fix the opt tester to fail with the full error information (using `%+v`) for assertion errors. Release note: None 38660: opt: push limit into offset r=ridwanmsharif a=ridwanmsharif This change pushes the limit into an offset whenever possible. This shouldn't worsen any plan but does allow the `GetLimitedScans` rule to fire in more scenarios. Fixes #30416. ~~This is currently blocked on #38659.~~ Release note: None 38743: roachtest: skip jepsen/multi-register r=god a=nvanbenschoten There's no use running this every night until #36431 is fixed. Release note: None 38746: roachtest: don't reuse clusters after test failure r=andreimatei a=andreimatei We've had a case where a cluster got messed up somehow and then a bunch of tests that tried to reuse it failed. This patch employes a big hammer and makes it so that we don't reuse a cluster after test failure (which failure can be cluster related or not). Release note: None 38766: scripts/release-notes.py: help the user with --from/--until r=lhirata a=knz Requested by @lhirata Release note: None Co-authored-by: Radu Berinde <[email protected]> Co-authored-by: Ridwan Sharif <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Andrei Matei <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]>

craig · 2019-07-09T16:47:27Z

Build succeeded

GitHub CI (Cockroach)

RaduBerinde requested review from justinj, knz, rytaft and andy-kimball June 29, 2019 03:26

RaduBerinde requested a review from a team as a code owner June 29, 2019 03:26

RaduBerinde force-pushed the opt-err-fix branch from c88ec28 to 3ef833f Compare June 29, 2019 03:28

knz reviewed Jun 29, 2019

View reviewed changes

knz mentioned this pull request Jul 5, 2019

errors: fix the formatting with %+v #38710

Merged

RaduBerinde force-pushed the opt-err-fix branch from 3ef833f to 1caf74a Compare July 8, 2019 20:24

RaduBerinde force-pushed the opt-err-fix branch 2 times, most recently from 54b69fb to 525b9ff Compare July 9, 2019 01:40

knz approved these changes Jul 9, 2019

View reviewed changes

knz suggested changes Jul 9, 2019

View reviewed changes

knz approved these changes Jul 9, 2019

View reviewed changes

knz reviewed Jul 9, 2019

View reviewed changes

RaduBerinde force-pushed the opt-err-fix branch from 525b9ff to 5ab44a9 Compare July 9, 2019 14:14

RaduBerinde requested a review from a team July 9, 2019 14:14

knz reviewed Jul 9, 2019

View reviewed changes

craig bot merged commit 5ab44a9 into cockroachdb:master Jul 9, 2019

RaduBerinde deleted the opt-err-fix branch July 10, 2019 18:19

knz mentioned this pull request Nov 10, 2019

User-facing changes in 19.2 that were not picked up in release notes cockroachdb/docs#5819

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

opt: fix panic recovery for error handling #38570

opt: fix panic recovery for error handling #38570

RaduBerinde commented Jun 29, 2019 •

edited

Loading

cockroach-teamcity commented Jun 29, 2019

knz left a comment

RaduBerinde commented Jun 29, 2019

knz commented Jun 29, 2019 via email

RaduBerinde commented Jul 1, 2019

RaduBerinde commented Jul 1, 2019

RaduBerinde commented Jul 1, 2019

knz commented Jul 1, 2019

RaduBerinde commented Jul 2, 2019

RaduBerinde commented Jul 8, 2019

knz left a comment

knz left a comment •

edited

Loading

knz left a comment

knz left a comment

RaduBerinde commented Jul 9, 2019

knz left a comment

RaduBerinde commented Jul 9, 2019

craig bot commented Jul 9, 2019

opt: fix panic recovery for error handling #38570

opt: fix panic recovery for error handling #38570

Conversation

RaduBerinde commented Jun 29, 2019 • edited Loading

cockroach-teamcity commented Jun 29, 2019

knz left a comment

Choose a reason for hiding this comment

RaduBerinde commented Jun 29, 2019

knz commented Jun 29, 2019 via email

RaduBerinde commented Jul 1, 2019

RaduBerinde commented Jul 1, 2019

RaduBerinde commented Jul 1, 2019

knz commented Jul 1, 2019

RaduBerinde commented Jul 2, 2019

RaduBerinde commented Jul 8, 2019

knz left a comment

Choose a reason for hiding this comment

knz left a comment • edited Loading

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

RaduBerinde commented Jul 9, 2019

knz left a comment

Choose a reason for hiding this comment

RaduBerinde commented Jul 9, 2019

craig bot commented Jul 9, 2019

Build succeeded

RaduBerinde commented Jun 29, 2019 •

edited

Loading

knz left a comment •

edited

Loading