rfcs: proposal for error handling #36987

knz · 2019-04-21T20:42:59Z

RFC text: https://github.com/knz/cockroach/blob/20190421-errors-rfc/docs/RFCS/20190318_error_handling.md

This RFC explains how our requirements for error handling have grown over time and how the various code patterns currently in use in CockroachDB are inadequate.

It then proposes a new library of error types. This library is compatible with the error interface, including the upcoming Go 2 semantics.

cockroach-teamcity · 2019-04-21T20:43:05Z

This change is

tbg

This generally looks good to me. I like how you're not using a "god error" and the code over in https://github.com/cockroachdb/cockroach/pull/36023/files gives me a good high-level overview over the wire encoding.

What's missing to my understanding is how the version migration works. 19.1 won't understand anything about these new errors, which is probably fine as long as it didn't try to deconstruct them in the first place. But what about the roachpb.Error boundary, for example? Are these just left alone for now or is a roachpb.NewError replaced by some errors.X call that at the same time results in something that the receiver (19.1 across the wire) can treat as a *roachpb.Error and get the detail from (seems unlikely)?

You've alluded to having set things up in a way that doesn't require a "forklift". It'd be interesting to know how the implementation would work. Could you just land the errors package without any changes to auxiliary code, and then pick a contained area of the code in which to integrate them? What would that area be?

tbg · 2019-04-22T14:05:30Z

docs/RFCS/20190318_error_handling.md

+
+These situations are largely out of scope for this RFC, except for the
+following: **we should aim to audit the code and ensure that no
+barrier errors is never dropped** and instead let to flow where it can


s/never/ever/

tbg · 2019-04-22T14:05:35Z

docs/RFCS/20190318_error_handling.md

+
+## Too much diversity
+
+There are currently 5 different error handling "protocols" inside


Would help to list them briefly.

tbg · 2019-04-22T14:05:37Z

docs/RFCS/20190318_error_handling.md

+
+1. `s1 := err.Error()`
+2. `s2 := err.Cause().Error()`
+3. `prefix_message := s1[:len(s2)-len(s1)]`


Oof, this may also break because who's to say that the error was created with errors.Wrap? Anyone can implement causer.

tbg · 2019-04-22T14:08:44Z

docs/RFCS/20190318_error_handling.md

+
+To add a code useful to PostgreSQL clients, one can use
+e.g. `errors.WithDefaultCode(err, pgerror.CodeSyntaxError)`.
+The new code is only used if the original error did not provide a code already.


Are these the right semantics? I suppose morally you want to not wrap an error with an error code twice. How about a boolean return or chaining with a marker error (so that you keep both codes, you can then decide if you want the inner our outer)?

tbg · 2019-04-22T14:11:55Z

docs/RFCS/20190318_error_handling.md

+
+Throughout the SQL package (and presumably over time throughout
+CockroachDB) errors can be annotated with "telemetry keys" to be
+increment when the error flows out of a server towards a client.


incremented

tbg · 2019-04-22T14:12:33Z

docs/RFCS/20190318_error_handling.md

+The following APIs are provided (upward-compatible from existing `pgerror` client code):
+
+```go
+func UnimplementedWithIssue(issue int, format string, args ...interface{})


This is missing either a receiver or return type, possibly both.

tbg · 2019-04-22T14:15:36Z

docs/RFCS/20190318_error_handling.md

+- `errors.WithDetail(err, detail)` returns `&withDetail{cause: err, detail: detail}`
+- `errors.Wrap(err, msg)` returns `&withMessage{cause: &withStack{cause: err, stack: callers()}, message: msg}`
+
+We use multiple elementary type instead of a single "god type" with


tbg · 2019-04-22T14:24:17Z

docs/RFCS/20190318_error_handling.md

+
+```go
+// EncodeError converts a Go error to an encodable error.
+func EncodeError(err error) AnyErrorContainer


Elaborate on AnyErrorContainer.

knz

What's missing to my understanding is how the version migration works.

Oh right, I forgot to write about that. Will add something tonight or tomorrow.

You've alluded to having set things up in a way that doesn't require a "forklift". It'd be interesting to know how the implementation would work. Could you just land the errors package without any changes to auxiliary code, and then pick a contained area of the code in which to integrate them? What would that area be?

There are two mechanisms:

replacing the import at the top from errors or github.com/pkg/errors to github.com/cockroachdb/cockroach/pkg/errors. This will "just work" already. Additional details (via pg code etc) can be added incrementally after that.
as to actually using the new mechanism (Markers, etc). I plan do do this change throughout sql and sqlbase to start (where they are most needed) then review the other SQL packages which I know best. As I learn more about core I plan to gradually "do things" there too.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @danhhz, and @tbg)

docs/RFCS/20190318_error_handling.md, line 83 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Would help to list them briefly.

Done.

docs/RFCS/20190318_error_handling.md, line 169 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Oof, this may also break because who's to say that the error was created with errors.Wrap? Anyone can implement causer.

Clarified in the example - this is only valid if s2 is suffix of s1.

docs/RFCS/20190318_error_handling.md, line 421 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

s/never/ever/

Done.

docs/RFCS/20190318_error_handling.md, line 541 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Are these the right semantics? I suppose morally you want to not wrap an error with an error code twice. How about a boolean return or chaining with a marker error (so that you keep both codes, you can then decide if you want the inner our outer)?

Clarified that this is an educated choice, and clarified that we can do different later.

docs/RFCS/20190318_error_handling.md, line 633 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

incremented

Done.

docs/RFCS/20190318_error_handling.md, line 640 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This is missing either a receiver or return type, possibly both.

Done.

docs/RFCS/20190318_error_handling.md, line 1093 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

types

Done.

docs/RFCS/20190318_error_handling.md, line 1188 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Elaborate on AnyErrorContainer.

Done. - but what is needed to know?

jordanlewis · 2019-04-22T16:27:38Z

docs/RFCS/20190318_error_handling.md

+
+### Adding `context`
+
+When wrapping an error with `errors.WithCtx(err, ctx)`, the logging


This scares me - context lifetimes are already complicated enough - we really don't want them to escape into errors if we can avoid it.

jordanlewis · 2019-04-22T16:27:57Z

docs/RFCS/20190318_error_handling.md

+"hello")`, although the form `Wrapx()` is preferred (see next section).
+
+To add a code useful to PostgreSQL clients, one can use
+e.g. `errors.WithDefaultCode(err, pgerror.CodeSyntaxError)`.


I don't understand this - what makes it a "default" code?

jordanlewis · 2019-04-22T16:30:34Z

docs/RFCS/20190318_error_handling.md

+  the barrier error will mask the original code and replace it
+  with `XX000/CodeInternalError`.
+
+- if the original pg error code was present with the `XX` prefix, then


I thought we were not supposed to inspect pg error codes?

jordanlewis · 2019-04-22T16:31:09Z

docs/RFCS/20190318_error_handling.md

+### Promotion of `XX` codes to barrier errors
+
+When a `XX` code is provided via `Wrapx`, `WithDefaultCode`, etc, the
+result becomes a barrier and the original error, if any, is masked as an


This seems "magic" and therefore best avoided. Don't you think it would be better to force explicit construction of barrier errors?

jordanlewis · 2019-04-22T16:33:47Z

docs/RFCS/20190318_error_handling.md

+e.g. `errors.Wrapf(err1, "while handling %v", err2)`.
+
+The proposed library extends this behavior and makes it possible to
+preserve multiple causes using `WithOtherCause()`, for example:


This seems like it has to be potential to be overly complex. Is it really that important to preserve the other cause? What's wrong with the current behavior in practice?

jordanlewis · 2019-04-22T16:36:08Z

docs/RFCS/20190318_error_handling.md

+
+## Problematic error use cases
+
+### Suspicious comparisons of the error object


Comparing to io.EOF isn't suspicious most of the time - that's how Go recommends that you check for error conditions from its library calls, which I'm sure is done in these code samples, right?

bdarnell

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @danhhz, @jordanlewis, @knz, and @tbg)

docs/RFCS/20190318_error_handling.md, line 340 at r1 (raw file):

```go
// Try an INSERT.
if err := txn.Exec(txn1); err != nil {

s/txn([12])/stmt\1/g

docs/RFCS/20190318_error_handling.md, line 359 at r1 (raw file):

if origErr := txn.Exec(txn1); origErr != nil {
  if sqlbase.IsDuplicateInsertError(origErr) {
     if newErr := txn.Exec(txn2); err != nil {

s/err/newErr/

docs/RFCS/20190318_error_handling.md, line 382 at r1 (raw file):

information loss in the current source code.

**In general, patterns of code like if-error-do-something-else need

I'd argue that in this particular case, the original code is not so bad, because IsDuplicateInsertError should have consumed all that is interesting about the original error. There might be cases where this is important, but a tree of error causes sounds unnecessarily complex (and without successful precedent to my knowledge).

Also note that this specific example doesn't work at the sql client level - once you've hit an error, you have to roll back the transaction. You have to use INSERT ON CONFLICT DO UPDATE.

docs/RFCS/20190318_error_handling.md, line 404 at r1 (raw file):

are all defective in the same way.

**In general, if-error-do-something-else-that-is-not-an-error is code smell

I agree with this, and would extend it to the previous section too. So why was the previous section used to justify a complex new requirement of a tree of error causes while this aspect is declared out of scope?

docs/RFCS/20190318_error_handling.md, line 427 at r1 (raw file):

## Motivation for a new error type: summary

The requirements on error objects have grown over time.

roachpb.Error was our original plan for structured error handling, and yet as requirements grew we added new error types (distsqlpb.Error and pgerror.Error) rather than extending roachpb.Error. I think it's worth discussing why this happened and what this means for the feasibility of one unifying error type.

Are there conflicting requirements at different levels? (I'm thinking about how we originally had a global "retryable" flag, but it turned out that different levels could retry different things). Are there backwards-compatibility requirements that make it difficult to change or replace roachpb.Error? Was everything forced through a bottleneck somewhere that turns things into strings or go 1 error? Is it just tech debt (or Conway's law) that we created separate error types instead of unifying?

docs/RFCS/20190318_error_handling.md, line 442 at r1 (raw file):

  We want to report important errors to telemetry (Sentry) for further inspection. However
  the report must be stripped of PII. We want error objects that preserve the "safe" part
  of details available when the error was produced or wrapped.

Could we meet this requirement by reporting errors to sentry at the point where the errors are generated instead of requiring that they be preserved through the whole stack? Or is the idea that the reportability of an error is depends on non-local conditions?

docs/RFCS/20190318_error_handling.md, line 444 at r1 (raw file):

  of details available when the error was produced or wrapped.

- **pg error codes.**

If this is a requirement, then the answer will have to be some bespoke cockroach-specific type. But is it really a requirement for the basic error type? As long as there is structure to the errors, we can map structured errors to pg errors at the top level (like we do now).

docs/RFCS/20190318_error_handling.md, line 498 at r1 (raw file):

# Guide-level explanation

The package is `github.com/cockroachdb/cockroach/pkg/errors`.

Does errors.Error implement error? If so, we can run into the nil interface problem and it's harder to access the structured features (and easy for them to get stripped away); if not it's intrusive to change signatures everywhere. We chose not to implement error for roachpb.Error, and I still think that's probably the right decision, but the dichotomy between the two error types has proved very annoying.

Ergonomics nit: Unless the plan is to prohibit all use of github.com/pkg/errors as soon as this package is introduced, please give it a different name to avoid confusion.

docs/RFCS/20190318_error_handling.md, line 531 at r1 (raw file):

The library is compatible with existing protobuf error objects, so
instantiating, for example, with `err :=

This is trivially valid because err has inferred type *roachpb.RangeRetryError. Did you mean to say that errors.Error is an interface type and we the existing protobuf errors can implement that interface?

docs/RFCS/20190318_error_handling.md, line 539 at r1 (raw file):

"hello")`, although the form `Wrapx()` is preferred (see next section).

To add a code useful to PostgreSQL clients, one can use

This level of specificity smells bad to me. It seems better to either have an extensible system (like context.Context?) for attaching metadata to an error or to defer the mapping of errors to pgerror codes until higher in the stack. I don't think it's a good idea (or if it's even possible) for code in core to care what error codes we'll ultimately return in pgwire.

Concretely, the MSO team is currently building a distributed system that does not speak pgwire, but has many of the other requirements regarding structured, wire-encodable errors. This system might want to use HTTP error codes in place of pg errors. It would be a shame if we either had to fork the error package for each project or build in special support for each kind of error annotation.

docs/RFCS/20190318_error_handling.md, line 687 at r1 (raw file):

intermediate message prefixes added via `github.com/pkg/errors.Wrap()`.

### Barrier errors

Nit: My initial read of "barrier errors" was to think of the threading concept of barriers and to expect something related to ErrGroup. I don't have a good alternative name, though. OpaqueError? ShieldedError? Or AssertionError since that's what you used in the factory functions.

docs/RFCS/20190318_error_handling.md, line 733 at r1 (raw file):

"internal cause" invisible via `causer`/`Wrapper` interfaces.

This makes it possible to e.g. wrap with `CodeCCLRequired` (`XXC01`)

Why would CodeCCLRequired ever be a wrapper instead of a leaf error?

docs/RFCS/20190318_error_handling.md, line 1169 at r1 (raw file):

// Compatibility with github.com/pkg/errors.
func Errorf(format string, args ...interface{}) Error
func WithStack(err error) error

Is the mix of Error and error here deliberate?

docs/RFCS/20190318_error_handling.md, line 1477 at r1 (raw file):

Suspicious comparisons of the error message

Looking through this list, most of these are inspecting errors that are generated outside our code from packages that don't provide any introspection interfaces. So those will still be around no matter what we do with our own errors. There are some things we should be doing better, but I think the volume of code included here paints an unnecessarily dark picture.

docs/RFCS/20190318_error_handling.md, line 737 at r2 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

This seems "magic" and therefore best avoided. Don't you think it would be better to force explicit construction of barrier errors?

Yeah, it seems to me that barrier errors shouldn't be so special. They're just errors without a cause, and a separately inspectable (optional!) piece of metadata that happens to be another error.

knz

Andrei RFA(small)L

Here are the latest changes:

I moved/removed text away from the "guide-level guide" and integrated the details in the "reference-level guide" instead, which makes the text flow better and the doc easier to navigate
I improved/extended the motivation for the 3 base modules (backbone, barriers, markers) in response to your comments (and also offline comments by Tobias)

What I haven't done yet: revisit/rewrite/simplify the two "subsumed errors" sections in the motivation section, based on your comments. That's why I haven't responded to this yet, but I will. I will do this later today.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @bdarnell, @danhhz, @jordanlewis, @knz, @nvanbenschoten, @tbg, and @vivekmenezes)

docs/RFCS/20190318_error_handling.md, line 1266 at r5 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

shouldn't this be expressed as an interface that error type can implement, instead of this need to "register" error types?

Created a new section "Discussion: callback registration vs. interfaces" below to motivate further.

docs/RFCS/20190318_error_handling.md, line 1458 at r5 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

we discussed using just the types for markers here and hoping for the best

I thought about it further and discussed with Tobias and came to the conclusion that it's going to be messages after all. Added a discussion section below to weigh both sides.

docs/RFCS/20190318_error_handling.md, line 1792 at r5 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

As we were discussing, my name is used here but I'm not particularly sure I want to advocate for actually doing this.
Let's see what Tobi says too, but maybe move this away to annex.

Tobias suggests leaving it there and let more people chime in about whether they like it.

docs/RFCS/20190318_error_handling.md, line 1817 at r5 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I'd call out here the fact that the EnsureNotInDomain() function does usually return barrier errors.

Done.

docs/RFCS/20190318_error_handling.md, line 1817 at r5 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

This requirement was about ContextDeadlineExceeded errors, right? I'd note that.

Done.

docs/RFCS/20190318_error_handling.md, line 2022 at r5 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

so WithAssertionFailure errors are supposed to wrap barrierError errors, right? Can't we flatten them - can't we make errutil.AssertionFailed() return an AssertionError that is not wrapping anything (or more exactly is not implementing causer)? I understand that generally you like to build small composable types, but here I don't think the composition gives you anything. And allowing assertions to wrap arbitrary errors seems confusing to me. It seems to me that it'd be better to be more prescriptive here.

Added a discussion section underneath to clarify.

docs/RFCS/20190318_error_handling.md, line 2276 at r5 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

wrappde

Done.

docs/RFCS/20190318_error_handling.md, line 2554 at r5 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

one thing that's not clear to me is how are we going to tell the "old client" cases, here and in the "Core RPCs" case below? Are we going to introduce a client version field in the requests?

I would think to produce the two error variants side-by-side (until the next release where we can delete the old path), like is done here:

cockroach/pkg/sql/distsqlpb/data.go

Lines 141 to 157 in 421483f

    
           func NewError(err error) *Error { 
        
           	resErr := &Error{} 
        
           	// Encode the full error to the best of our ability. 
        
           	// This field is ignored by pre-19.1 nodes. 
        
           	fullError := sqlerror.EncodeError(err) 
        
           	resErr.FullError = &fullError 
        
           	// Now populate compatibility fields for pre-19.1 nodes. 
        
           	if retryErr, ok := errors.Cause(err).(*roachpb.UnhandledRetryableError); ok { 
        
           		resErr.Detail = &Error_RetryableTxnError{RetryableTxnError: retryErr} 
        
           	} else { 
        
           		pgErr, _, _ := pgerror.Flatten(err) 
        
           		resErr.Detail = &Error_PGError{PGError: pgErr} 
        
           	} 
        
           	return resErr 
        
           }

knz

RFA(generalized)L.
I think all comments so far have been addressed.

Fixed/removed the dubious text on subsumed errors towards he beginning.
Also added an "implementation strategy" section towards the end.
Added cross-references between guide and reference-level sections.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @bdarnell, @danhhz, @jordanlewis, @knz, @nvanbenschoten, @tbg, and @vivekmenezes)

docs/RFCS/20190318_error_handling.md, line 382 at r1 (raw file):

Previously, knz (kena) wrote…

Also note that this specific example doesn't work at the sql client level - once you've hit an error, you have to roll back the transaction. You have to use INSERT ON CONFLICT DO UPDATE.

Alas, there is code that does exactly this. Consider from schema_changer.go:
  // Run through mutation state machine and backfill.
  err = sc.runStateMachineAndBackfill(ctx, &lease, evalCtx)

  // Purge the mutations if the application of the mutations failed due to
  // a permanent error. All other errors are transient errors that are
  // resolved by retrying the backfill.
  if isPermanentSchemaChangeError(err) {
    if err := sc.rollbackSchemaChange(ctx, err, &lease, evalCtx); err != nil {
      return err
    }
  }

  return err
inside rollbackSchemaChange, the err given as argument evaporates. If anything "interesting" was present in err, nobody would ever know.

Do you think I should use that example directly instead of using a synthetic example?

a tree of error causes sounds unnecessarily complex (and without successful precedent to my knowledge).

The emphasis is on capturing the precise causes and context for the purpose of troubleshooting and post-mortem analysis, not to have downstream code do anything useful with the tree of causes. In fact if you follow along the API definition, all the logic emphasizes the "direct" chain of causes and the "other" causes are only traversed to produce synthetic reports (the pg error object or the sentry packet).

(I have added a sentence to clarify the goal.)

Replaced the example.

docs/RFCS/20190318_error_handling.md, line 386 at r5 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

nit: this example is not very good because neither the go's sql package nor our internal client package let one continue with a transaction after an error like this. At either level, txn.Exec(stmt) won't work.

You're right, I have replaced the example.

docs/RFCS/20190318_error_handling.md, line 439 at r5 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Did you mean to write "assertion error" here instead of "barrier error"? Cause otherwise I for one don't have a problem with letting barrier errors be "dropped".

My suggestion is to drop this "Errors subsumed by non-errors" section. I don't quite agree with it, and even if I did, it does not seem to contribute anything to the RFC besides the "we should aim" sentence, which which I also disagree (unless you change barrier->assertion).

The general problem is that "if-error-do-something-else-that-is-not-an-error" loses information.

It does lose information, but so what? Everything loses information - every time you update a variable you lose information about what the old state was. We don't particularly worry about this in other situations - so why would we worry about it when we handle errors?
I don't (or at least no longer) have a problem with your proposal to added fairly opaque "secondary causes" to errors as you suggest in other sections cause in that case there's an error flowing around and so it's clear what to attach this extra information to, but when there's no error, what even is your suggestion? You say that the protocol should be different and the original error shouldn't have been an error is someone can handle it, but in that case we would "lose information" just the same.

In general, if-error-do-something-else-that-is-not-an-error is code smell
that should be avoided. It suggests that the original error object
was actually something else than error, for example it should have
been part of the function protocol.

In my opinion, this is a very strong and prescriptive statement and in the general case I don't particularly agree. All errors that don't lead to a crash are handled at some level and thus are transformed into no error...
The philosophy of Go is pretty clear - errors are values like any other, not particularly exceptional. For example, I think io.EOF is a fine thing to return from IO routines.
I agree that some things should not be represented as error, but it's case by case. For example, I agree that CPuts should not have returned errors when the condition fails. The reason why I agree is because for the KV layer in particular, errors mean that you can't continue with the transaction, and it would have been beneficial to be able to continue after a CPut error (and there's no particular reason why you couldn't).
On the other hand, I disagree that retriable errors should not have been errors. In those cases, you generally indeed cannot just continue with the transaction, and so errors seem appropriate to me. Also consider how much code would be polluted by having some other return value for signaling these retriable conditions.

I'm curious to hear other opinions, though. @bdarnell for one seemed to agree with a previous phrasing of this section, and I wonder what he thinks.

I removed the section altogether and replaced it with a simpler section with a much narrower scope.

andreimatei

🎆 looking fwd to all of this

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @bdarnell, @danhhz, @jordanlewis, @knz, @nvanbenschoten, @tbg, and @vivekmenezes)

tbg

I didn't read all of the proposal in detail but I did read the rationale and guide level explanation carefully. Looking forward to all of this happening.

Reviewed 1 of 2 files at r5, 1 of 1 files at r6.
Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @andreimatei, @bdarnell, @danhhz, @jordanlewis, @knz, @nvanbenschoten, and @vivekmenezes)

docs/RFCS/20190318_error_handling.md, line 609 at r6 (raw file):

In the proposed library, the postgres details can be added with e.g.
`errors.WithDetail(err, "some detail")`.

Is this (detail and hints) specific to PG errors or could we also adopt this elsewhere? I'm asking a) because that might be useful and b) if it's specific to PG WithDetail sounds too general. Perhaps WithPG{Detail,Hint}. Or we just accept this special casing.

docs/RFCS/20190318_error_handling.md, line 702 at r6 (raw file):

  // Was:
  //
  //    if err == io.EOF { ...

Consider giving an anti-example. For example, if errors.Is(err, &SomeError{}) { ... } would never match, right? That is, Is is only for sentinel errors.

docs/RFCS/20190318_error_handling.md, line 780 at r6 (raw file):

printing

docs/RFCS/20190318_error_handling.md, line 850 at r6 (raw file):

     // Try with an UPDATE instead.
     if newErr := txn.Exec(stmt2); err != nil {
        err := errors.Wrap(newErr, "while updating")

Mighi be worth mentioning which error gets to go first. In this example, it's the innermost error. Is that always the case? Also, what if there is more than one secondary error? Can we errors.WithSecondary an error more than once?

docs/RFCS/20190318_error_handling.md, line 1004 at r6 (raw file):

> special support for each kind of error annotation.

We can extend the system to adopt HTTP error codes as follows:

Is there a reason the PGWire-specific functionality wasn't implemented as an extension (or was it)?

docs/RFCS/20190318_error_handling.md, line 1028 at r6 (raw file):

```go
func GetHTTPCode(err error) int {
    for ; err != nil; err = errors.UnwrapOnce(err) {

Interesting, I would've expected there to be some sugar for this operation. Why not errors.If? Hmm, looks like that's just more and more awkward code. Ok, nevermind, this is good.

docs/RFCS/20190318_error_handling.md, line 1045 at r6 (raw file):

For this we can work as follows:

- add a new wrapper type `withLogTags{cause error, tags ...}`. Make

Generally this types should have error in their name somewhere, don't you think? I.e. withLogTagsError or something like that. Or did you deliberately not do that because these are just wrapper errors that don't contain a message?

docs/RFCS/20190318_error_handling.md, line 1056 at r6 (raw file):

- ensure that `withLogTags{}` implements the `Format` method so that
  the logging tags are included when the entire error chain is rendered
  via `%+v`.

The tags would also be printed from .Error(), right? Would be nice to have them in the logs. I think this is what you were planning, but maybe make it explicit somewhere.

danhhz

There's already a lot of cooks in this kitchen, so I didn't do a line-by-line review. Approach this is extraordinary work

Reviewable status: complete! 3 of 0 LGTMs obtained (waiting on @andreimatei, @bdarnell, @danhhz, @jordanlewis, @knz, @nvanbenschoten, and @vivekmenezes)

approval by other team members

37121: errors: introduce a general-purpose modular error library with good properties r=knz a=knz This PR introduces an error library implemented after the principles laid out in #36987. It subsumes the following two PRs: - #38127 *: demonstrate uses of the errors library - review approval: #38127 (comment) - #37813 sql, *: simplify remove dependencies on pgerror, util/log - review approval: #37813 (comment) Release note: None Co-authored-by: Raphael 'kena' Poss <[email protected]>

Release note: None

knz · 2019-06-18T17:17:45Z

completed in #37121

bors r+

36987: rfcs: proposal for error handling r=knz a=knz RFC text: https://github.com/knz/cockroach/blob/20190421-errors-rfc/docs/RFCS/20190318_error_handling.md This RFC explains how our requirements for error handling have grown over time and how the various code patterns currently in use in CockroachDB are inadequate. It then proposes a new library of error types. This library is compatible with the error interface, including the upcoming Go 2 semantics. Co-authored-by: Raphael 'kena' Poss <[email protected]>

craig · 2019-06-18T17:55:39Z

Build succeeded

GitHub CI (Cockroach)

knz added the do-not-merge bors won't merge a PR with this label. label Apr 21, 2019

knz requested a review from a team as a code owner April 21, 2019 20:42

knz force-pushed the 20190421-errors-rfc branch 14 times, most recently from 2c22ee9 to 191ec10 Compare April 22, 2019 12:09

knz changed the title ~~(WIP) rfcs: proposal for error handling~~ rfcs: proposal for error handling Apr 22, 2019

knz requested review from danhhz, andreimatei and tbg April 22, 2019 12:10

knz force-pushed the 20190421-errors-rfc branch 2 times, most recently from af05e2b to 2bf40ae Compare April 22, 2019 12:54

knz mentioned this pull request Apr 22, 2019

client: make roachpb.NewError work with wrapped ErrorDetailInterface errors #36918

Closed

tbg reviewed Apr 22, 2019

View reviewed changes

knz commented Apr 22, 2019

View reviewed changes

knz force-pushed the 20190421-errors-rfc branch 2 times, most recently from e1a2974 to c97a582 Compare April 22, 2019 16:22

jordanlewis reviewed Apr 22, 2019

View reviewed changes

bdarnell reviewed Apr 22, 2019

View reviewed changes

knz force-pushed the 20190421-errors-rfc branch from 975e504 to 9b8534a Compare May 10, 2019 16:02

knz commented May 10, 2019

View reviewed changes

knz force-pushed the 20190421-errors-rfc branch 3 times, most recently from 1de7553 to 9a21121 Compare May 11, 2019 11:58

knz commented May 11, 2019

View reviewed changes

knz removed the do-not-merge bors won't merge a PR with this label. label May 11, 2019

knz requested a review from bdarnell May 11, 2019 12:08

andreimatei approved these changes May 16, 2019

View reviewed changes

knz mentioned this pull request May 20, 2019

sql: include anonymized query with internal error reports #37514

Closed

tbg approved these changes May 20, 2019

View reviewed changes

danhhz approved these changes May 20, 2019

View reviewed changes

knz force-pushed the 20190421-errors-rfc branch from 9a21121 to 7fa6178 Compare May 20, 2019 15:56

knz mentioned this pull request May 23, 2019

*: demonstrate uses of the errors library #37765

Closed

knz force-pushed the 20190421-errors-rfc branch 4 times, most recently from 514c377 to ddd3bf4 Compare June 10, 2019 11:34

knz mentioned this pull request Jun 10, 2019

*: demonstrate uses of the errors library #38127

Merged

knz force-pushed the 20190421-errors-rfc branch 2 times, most recently from 106b1d7 to 02cab13 Compare June 10, 2019 13:31

rfcs: proposal for error handling

2fe327c

Release note: None

knz force-pushed the 20190421-errors-rfc branch from 02cab13 to 2fe327c Compare June 18, 2019 17:17

craig bot merged commit 2fe327c into cockroachdb:master Jun 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rfcs: proposal for error handling #36987

rfcs: proposal for error handling #36987

knz commented Apr 21, 2019 •

edited

Loading

cockroach-teamcity commented Apr 21, 2019

tbg left a comment

tbg Apr 22, 2019

tbg Apr 22, 2019

tbg Apr 22, 2019

tbg Apr 22, 2019

tbg Apr 22, 2019

tbg Apr 22, 2019

tbg Apr 22, 2019

tbg Apr 22, 2019

knz left a comment

jordanlewis Apr 22, 2019

jordanlewis Apr 22, 2019

jordanlewis Apr 22, 2019

jordanlewis Apr 22, 2019

jordanlewis Apr 22, 2019

jordanlewis Apr 22, 2019

bdarnell left a comment

knz left a comment

knz left a comment

andreimatei left a comment

tbg left a comment

danhhz left a comment

knz commented Jun 18, 2019

craig bot commented Jun 18, 2019


		## Too much diversity

		There are currently 5 different error handling "protocols" inside


		### Adding `context`

		When wrapping an error with `errors.WithCtx(err, ctx)`, the logging


		## Problematic error use cases

		### Suspicious comparisons of the error object

	func NewError(err error) *Error {
	resErr := &Error{}

	// Encode the full error to the best of our ability.
	// This field is ignored by pre-19.1 nodes.
	fullError := sqlerror.EncodeError(err)
	resErr.FullError = &fullError

	// Now populate compatibility fields for pre-19.1 nodes.
	if retryErr, ok := errors.Cause(err).(*roachpb.UnhandledRetryableError); ok {
	resErr.Detail = &Error_RetryableTxnError{RetryableTxnError: retryErr}
	} else {
	pgErr, _, _ := pgerror.Flatten(err)
	resErr.Detail = &Error_PGError{PGError: pgErr}
	}
	return resErr
	}

rfcs: proposal for error handling #36987

rfcs: proposal for error handling #36987

Conversation

knz commented Apr 21, 2019 • edited Loading

cockroach-teamcity commented Apr 21, 2019

tbg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdarnell left a comment

Choose a reason for hiding this comment

Suspicious comparisons of the error message

knz left a comment

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

danhhz left a comment

Choose a reason for hiding this comment

knz commented Jun 18, 2019

craig bot commented Jun 18, 2019

Build succeeded

knz commented Apr 21, 2019 •

edited

Loading