Improved Error Handling #625

rylev · 2022-01-24T16:45:03Z

This is a proposal for a new form of error handling in the SDKs I have been working on with @yoshuawuyts. This is an attempt to reduce the large amount of errors into as compact a form as possible while still offering good ergonomics and quality error messages.

In the grand scheme of things, this proposal is not a radical departure from the current approach taken, but is an attempt to rethink error handling from first principles.

The current approach

The current approach to error handling is to create a large, all-encompassing error type modeled as a Rust enum. This looks like this:

/// A general Azure error type.
#[non_exhaustive]
#[derive(Debug, thiserror::Error)]
pub enum Error {
    #[error("parse error")]
    Parse(#[from] ParseError),
    #[error("error getting token")]
    GetToken(#[source] Box<dyn std::error::Error + Send + Sync>),
    #[error("http error")]
    Http(#[from] HttpError),
    /// .... 
}

This approach does have some nice properties:

The thiserror crate makes it pretty easy to declare new error variants together with how they are displayed to the user with minimal boiler plate and that conform to best practices for error types including provide an implementation for the Error::source method which shows what other error caused this error.
Error messages are relatively consistent (i.e., if two parsing errors can happen in two different SDKs, they'll largely have the same error message)
More error types can be added without breaking existing user code (through the use of non_exhaustive.
Conversions from 3rd party error types to this error type are generally fairly easy.

However, this approach does have downsides:

The specific variant of Error in question is tied directly to the underlying data or error cause. For example, in the code above, an Error::Parse must always have a corresponding ParseError. This encourages an ever expanding tree of errors.
Context dependent error messages are hard to add. Since the data associated with an error is directly tied to the error variant, it's hard to add additional context to errors that happen for unique reasons. For example,. say a parsing error when the key "foo" must have a value with the prefix "bar" - this happens only in one place in all the SDKs. If you want to display to the user this important context, you need to add a new variant to the ParseError type that explains this. This adds even more pressure for an ever expanding tree of error types.

The above list is not exhaustive but in general should give you the impression that the way we're modelling errors encourages error types with lots of variants and sub errors. This is exactly what we see in practice, where the error types in azure_core are bloated and often hard to keep track of.

What do we need?

Before getting into solutions, I'd like to briefly discuss the general needs we have from our errors. Errors are generally used for two things:

Control flow (i.e., the user inspects the errors and writes code specifically to handle a certain error case)
Error messages (i.e., the user just needs to display an error to the end user or developer consuming the library)

The more important of these two for our purposes is likely error messages. While no doubt, users of the library will want to write code that inspects some errors and handles them in special ways, the large majority of errors will be propagated up the stack either through use of ? or by panicking.

Therefore, we need errors that can handle control flow but more importantly, encourage good error messages.

The status quo provides ok error messages and excels at handling control flow, the exact opposite of our needs.

A proposed solution

This PR introduces a proposal for a new Error type that eventually will completely replace the current Error type (but that still works with it so that the transition does not need to happen all at once).

This new error is based closely on std::io::Error. Essentially the new error type wraps other errors and/or an error message along with an ErrorKind that roughly corresponds to the category of error being modeled.

First, let's see how this new error handles control flow (arguably the weaker of its capabilities):

Control Flow

 match error.kind() {
     ErrorKind::UnexpectedOperation { status } if status == 503 => {
           // TODO: handle 503 errors specially
     }
     ErrorKind::Io => {
          if let Some(e) = error.downcast_ref::<std::io::Error>() {
               if e.kind() == std::io::ErrorKind::ConnectionAborted {
                   // TODO: handle an aborted connection
               }
          }
     }
     _ => return Err(error),
}

As you can see from this example, control flow is still very much possible, but does now rely on runtime downcasting for control flow of errors that we generally expect users to not have to handle. For example, in this case, the user wants to handle ConnectionAborted but we expect this to not be a super common use case as the SDK already does automatic retries and a policy will likely be there to handle offline scenarios in the general case.

NOTE: the rest of this section is no longer true and has been removed from the design.

The biggest difference with the current design is that the new Error is actually generic over some type that represents operation specific errors. For example, if we're dealing with an error in the Cosmos create_database operation, the error type would be Error<CreateDatabaseError>. The CreateDatabaseError would include all the documented cases specific to CreateDatabase - for example, a bad request or a conflict. These errors are the ones we generally expect users to want to handle, and they are handled in a first class way:

 if let ErrorKind::Operation(CreateDatabaseError::Conflict) = error.kind() {
     /// TODO: handle conflict
}

Error messages

The new Error type comes with an improved ability to provide good error messages. The Error type conforms to best practices of the std::error::Error including providing a source for each error and displaying errors in such a way that plays nicely with ecosystem crates like eyre for displaying good error messages.

Included are conveniences like extensions to the Result type which allow for providing additional context to the error:

HeaderValue::from_str(&self.header).with_context(
    crate::error::ErrorKind::DataConversion,
    || {
        format!(
            "user agent '{}' cannot be serialized as an ascii-only HTTP header",
            self.header
        )
    },
)?;

Comparison

The new error type improves upon the old error type by not encouraging a growing tree of error types. It provides conveniences for providing good error messages while prioritizing a small error type that plays well with the ecosystem, provides good error messages, and provides control flow handling for cases where that handling is most likely needed.

heaths · 2022-01-24T17:56:09Z

From what crate is HttpError? Across our other language SDKs, we all have some sort of general HTTP error that includes the status code (integer) and status message i.e. description of the status code e.g. 400 is "Bad request" - typically from the standard Azure error response.

Ah, we define it and do have a generic StatusCode:

azure-sdk-for-rust/sdk/core/src/errors.rs

Lines 119 to 120 in ec80528

    
           #[error("HTTP error status (status: {:?}, body: {:?})", status, body)] 
        
           StatusCode { status: StatusCode, body: String },

Though, shouldn't the subsequent Utf8 error be more of a parse error? Sure, the service messed up, but the HTTP response itself would still likely indicate some sort of status like 200.

yoshuawuyts · 2022-01-25T14:11:34Z

From what crate is HttpError?

This is from the hyperium/http crate. This crate is used in the hyper and reqwest crates, and in order to wrap the errors in our Error type we need to write conversions to it.

Though, shouldn't the subsequent Utf8 error be more of a parse error? Sure, the service messed up, but the HTTP response itself would still likely indicate some sort of status like 200.

I'm not sure which Utf8 error this refers to, but in general the idea is that each resource defines its own set of error conditions it expects from the server, and for all other status-code related errors we assign them ErrorKind::UnexpectedOperation. Here are the error kinds we currently have:

/// The kind of error
///
/// The classification of error is intentionally fairly coarse.
#[derive(Copy, Clone, Debug, PartialEq)]
pub enum ErrorKind<O> {
    /// An error specific to a certain operation
    Operation(O),
    /// An HTTP status code that was not expected
    UnexpectedOperation { status: u16 },
    /// An error performing IO
    Io,
    /// An error converting data
    DataConversion,
    /// An error getting an API credential token
    Credential,
    /// A catch all for other kinds of errors
    Other,
}

Every operation is expected to return a Result<Data, azure_core::Error<OperationError>> where we encode the various known error cases into the OperationError type. This means that for example for cosmos::CreateDatabase which defines the following error codes, we encode these into an operation error like so:

#[derive(Debug, Clone, Copy, PartialEq, Eq)]
/// Expected errors which may be returned by the `CreateDatabase` operation.
pub enum CreateDatabaseError {
    /// Returned when the JSON body is invalid. Check for missing curly brackets or quotes.
    BadRequest,
    /// Returned when the ID provided for the new database has been taken by an existing database.
    Conflict,
}

If the server returns a statuscode which doesn't match any of the expected error kinds, we then assign it to an error with kind ErrorKind::UnexpectedOperation which contains the status code and message body.

Other error kinds such as ErrorKind::Io and ErrorKind::DataConversion are intended to be used to encode errors originating in our pipelines and backing transports.

Overall it makes sense that this is quite a bit to cover, and I'm hoping we can find some time in the sync meeting later today to walk through how the error handling scheme we propose in this PR works.

sdk/core/src/error/mod.rs

sdk/core/src/error/azure_core_errors.rs

sdk/core/src/error/mod.rs

nrc · 2022-01-28T13:40:41Z

sdk/core/src/error/mod.rs

+
+/// An extention to the `Result` type that easy allows creating `Error` values from exsiting errors
+///
+/// This trait should not be implemented on custom types and is meant for usage with `Result`


Could use the sealed trait pattern to enforce this

rylev · 2022-02-02T09:30:39Z

This PR is in a state where we could merge it. However, there are additional elements that I would like to follow up on - either in this PR or a subsequent one:

We currently heavily favor adding context to errors by adding string messages. It was pointed out by @nrc that this makes things like structured logging difficult. Instead it would be nice to explore how more structured data can be used to add context to errors.
Sealing the ResultExt trait so that others don't implement it
@yoshuawuyts and I have had subsequent chats about providing a "best effort" mechanism for structured UnsuccessfulResponse errors which gracefully fallback to what we're currently doing now (providing the raw HTTP status code and Azure error_code to the user). The team seemed happy with just providing the raw status code and error_code, but providing structured error conditions where possible does appear to be a better user experience. Since this is hard to change in the future, I'd like to make sure we've really examined all possibilities before settling on one.

rylev · 2022-02-21T15:28:09Z

Going to merge this. For the above three points, we decided:

more structured error reporting is a good thing, but we lack a clear vision for how exactly to do this. We'll continue exploring, but we don't want to delay merging this more.
I've made the ResultExt trait sealed
Talking this over with @JeffreyRichter, we confirmed that we do not want to make a best effort attempt at enumerating possible error conditions. The other language SDKs have tried this and it's resulted in worse user experiences.

heaths reviewed Jan 25, 2022

View reviewed changes

sdk/core/src/error/mod.rs Outdated Show resolved Hide resolved

rylev commented Jan 25, 2022

View reviewed changes

sdk/core/src/error/mod.rs Outdated Show resolved Hide resolved

cataggar added this to the azure_core 0.2.0 milestone Jan 26, 2022

cataggar added the Azure.Core The azure_core crate label Jan 26, 2022

heaths reviewed Jan 26, 2022

View reviewed changes

nrc reviewed Jan 28, 2022

View reviewed changes

rylev force-pushed the new-error branch from 6f430dd to 813be32 Compare January 28, 2022 17:34

ftt mentioned this pull request Feb 19, 2022

access to vault to have better error message #14

Closed

rylev added 18 commits February 21, 2022 16:28

Add new error type

8e45b29

Use error type in ListDatabases

014fe93

Use error type in ListDatabasesResponse

76b0d5f

Add some docs

3bcbb57

Treat Error type like wrapper when wrapping an error

d3e9bce

Add helper macros

e33d599

Make Error generic over operation specific errors:

ab4e2d6

Move CreateDatabase to the new error

06393c3

Move pipeline to use new error type

86d50e3

Add way to downcast to error types directly

27cf9d8

Add missing error type

4fe8ace

PR feedback

4350eb8

Get mock framework working

aa03076

Get generated services working

770a1f5

Introduce error code

3d7a89a

Fix broken mock test

3e612c8

Format code

aa1208e

Seal the ResultExt trait

36fb5f4

rylev force-pushed the new-error branch from dc7b164 to 36fb5f4 Compare February 21, 2022 15:30

yoshuawuyts approved these changes Feb 21, 2022

View reviewed changes

yoshuawuyts merged commit 58fcc4e into Azure:main Feb 21, 2022

rylev mentioned this pull request Mar 25, 2022

Remove azure_data_cosmos::Error #702

Merged

cataggar changed the title ~~[Proposal] Error Handling~~ Improved Error Handling Mar 30, 2022

cataggar mentioned this pull request May 27, 2022

migrate SDKs to azure_core::error::Error #771

Closed

12 tasks

cataggar mentioned this pull request Jun 6, 2022

Cosmos: Errors should be a concrete custom type #128

Closed

rylev deleted the new-error branch August 23, 2022 18:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved Error Handling #625

Improved Error Handling #625

rylev commented Jan 24, 2022 •

edited by thovoll

Loading

heaths commented Jan 24, 2022 •

edited

Loading

yoshuawuyts commented Jan 25, 2022

nrc Jan 28, 2022

rylev commented Feb 2, 2022

rylev commented Feb 21, 2022

Improved Error Handling #625

Improved Error Handling #625

Conversation

rylev commented Jan 24, 2022 • edited by thovoll Loading

The current approach

What do we need?

A proposed solution

Comparison

heaths commented Jan 24, 2022 • edited Loading

yoshuawuyts commented Jan 25, 2022

nrc Jan 28, 2022

Choose a reason for hiding this comment

rylev commented Feb 2, 2022

rylev commented Feb 21, 2022

rylev commented Jan 24, 2022 •

edited by thovoll

Loading

heaths commented Jan 24, 2022 •

edited

Loading